The association-rule-based recommendation is widespread in many big data applications which need quick response to improve user experience. Spark is a widely used distributed computing platform, which accelerates the processing of large-scale distributed data. Developing appropriate distributed algorithm for Spark is essential to decrease the processing time of distributed recommendation. The existing FP-Growth in Spark is a popular parallel recommendation method but getting the best performance only when the memory of machines can accommodate all immediate Resilient Distributed DataSets (RDDs). However, memory of many practice data centers is still not large enough for large data sets. Therefore, in this paper, a caching-based parallel FP-Growth is proposed which consists of an integer-based sorting and an RDD-caching strategy to improve the efficiency. Experimental results show that the proposal decreases the execution time by 32.37% on average compared with the existing parallel FP-Growth in Spark. Furthermore, impacts of some important parameters upon the performance of the proposal are analyzed by numerous realistic experiments in Spark.
CITATION STYLE
Cai, Z., Zhu, X., Zheng, Y., Liu, D., & Xu, L. (2018). A caching-based parallel FP-growth in apache spark. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11336 LNCS, pp. 519–533). Springer Verlag. https://doi.org/10.1007/978-3-030-05057-3_39
Mendeley helps you to discover research relevant for your work.