Combining powerful parallel frameworks and on-demand commodity hardware, cloud computing has made both analytics and decision support systems canonical to enterprises of all sizes. Associated with unprecedented volumes of data stacked by such companies, ltering and retrieving them are pressing challenges. This data is often organized in star schemas, in which Star Joins are ubiquitous and expensive operations. In particular, excessive disk spill and network communication are tight bottlenecks for all current MapReduce or Spark solutions. Here, we propose two e cient solutions that drop the computation time by at least 60%: the Spark Bloom-Filtered Cascade Join (SBFCJ) and the Spark Broadcast Join (SBJ). Conversely a direct Spark implementation of a sequence of joins renders poor performance, showcasing the importance of further ltering for minimal disk spill and network communication. Finally while SBJ is twice faster when memory per executor is large enough, SBFCJ is remarkably resilient to low memory scenarios. Both algorithms pose very competitive solutions to Star Joins in the cloud.
Brito, J. J., Mosqueiro, T., Ciferri, R. R., & De Aguiar Ciferri, C. D. (2016). Faster cloud Star Joins with reduced disk spill and network communication. In Procedia Computer Science (Vol. 80, pp. 74–85). Elsevier B.V. https://doi.org/10.1016/j.procs.2016.05.299