Similarity Grouping in Big Data Systems

Yasin N. Silva; Manuel Sandoval; Diana Prado; Xavier Wallace; Chuitian Rong

Conference Proceedings

Similarity Grouping in Big Data Systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11807 LNCS 212-220

DOI: 10.1007/978-3-030-32047-8_19

1Citations

2Readers

Get full text

Abstract

Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG’s design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.

Author supplied keywords

Cite

CITATION STYLE

APA

Silva, Y. N., Sandoval, M., Prado, D., Wallace, X., & Rong, C. (2019). Similarity Grouping in Big Data Systems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11807 LNCS, pp. 212–220). Springer. https://doi.org/10.1007/978-3-030-32047-8_19

Similarity Grouping in Big Data Systems

Abstract

Author supplied keywords

Cite

Register to see more suggestions