Distributed SQL Query Engines (DSQEs) are increasingly used in a variety of domains, but especially users in small companies with little expertise may face the challenge of selecting an appropriate engine for their specific applications. Although both industry and academia are attempting to come up with high level benchmarks, the performance of DSQEs has never been explored or compared in-depth. We propose an empirical method for evaluating the performance of DSQEs with representative metrics, datasets, and system condigurations. We implement a micro-benchmarking suite of three classes of SQL queries for both a synthetic and a real world dataset and we report response time, resource utilization, and scalability. We use our micro-benchmarking suite to analyze and compare three state-of-the-art engines, viz. Shark, Impala, and Hive. We gain valuable insights for each engine and we present a comprehensive comparison of these DSQEs. We find that difierent query engines have widely varying performance: Hive is always being outperformed by the other engines, but whether Impala or Shark is the best performer highly depends on the query type.
CITATION STYLE
Van Wouw, S., Viña, J., Iosup, A., & Epema, D. (2015). An empirical performance evaluation of distributed SQL query engines. In ICPE 2015 - Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (pp. 123–131). Association for Computing Machinery, Inc. https://doi.org/10.1145/2668930.2688053
Mendeley helps you to discover research relevant for your work.