Improving the performance of Hadoop Hive by sharing scan and computation tasks

Tansel Dokeroglu; Serkan Ozal; Murat Ali Bayir; Muhammet Serkan Cinar; Ahmet Cosar

Journal ArticleOPEN ACCESS

Improving the performance of Hadoop Hive by sharing scan and computation tasks

Journal of Cloud Computing (2014) 3(1) 1-11

DOI: 10.1186/s13677-014-0012-6

25Citations

31Readers

Abstract

MapReduce is a popular programming model for executing time-consuming analytical queries as a batch of tasks on large scale data clusters. In environments where multiple queries with similar selection predicates, common tables, and join tasks arrive simultaneously, many opportunities can arise for sharing scan and/or join computation tasks. Executing common tasks only once can remarkably reduce the total execution time of a batch of queries. In this study, we propose a Multiple Query Optimization framework, SharedHive, to improve the overall performance of Hadoop Hive, an open source SQL-based data warehouse using MapReduce. SharedHive transforms a set of correlated HiveQL queries into a new set of insert queries that will produce all of the required outputs within a shorter execution time. It is experimentally shown that SharedHive achieves significant reductions in total execution times of TPC-H queries.

Author supplied keywords

Cite

CITATION STYLE

APA

Dokeroglu, T., Ozal, S., Bayir, M. A., Cinar, M. S., & Cosar, A. (2014). Improving the performance of Hadoop Hive by sharing scan and computation tasks. Journal of Cloud Computing, 3(1), 1–11. https://doi.org/10.1186/s13677-014-0012-6

Improving the performance of Hadoop Hive by sharing scan and computation tasks

Abstract

Author supplied keywords

Cite

Register to see more suggestions