Set Similarity Joins with Complex Expressions on Distributed Platforms

Diego Junior do Carmo Oliveira; Felipe Ferreira Borges; Leonardo Andrade Ribeiro; Alfredo Cuzzocrea

Conference Proceedings

Set Similarity Joins with Complex Expressions on Distributed Platforms

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11019 LNCS 216-230

DOI: 10.1007/978-3-319-98398-1_15

3Citations

7Readers

Get full text

Abstract

A set similarity join finds all similar pairs from a collection of sets. This operation is essential for many important tasks in Big Data analytics including string data integration and cleaning. The vast majority of set similarity join algorithms proposed so far considers string data represented by a single set over which a simple similarity predicate is defined. However, real data is typically multi-attribute and, thus, better represented by multiple sets. Such a representation requires complex expressions to capture a given notion of similarity. Moreover, similarity join processing under this new formulation is clearly more expensive, which calls for distributed algorithms to deal with large datasets. In this paper, we present a distributed algorithm for set similarity joins with complex similarity expressions. Our approach supports complex Boolean expressions over multiple predicates. We propose a simple, but effective data partitioning strategy to reduce both communication and computation costs. We have implemented our algorithm in Spark, a popular distributed data processing engine. Experimental results show that the proposed approach is efficient and scalable.

Author supplied keywords

Cite

CITATION STYLE

APA

do Carmo Oliveira, D. J., Borges, F. F., Ribeiro, L. A., & Cuzzocrea, A. (2018). Set Similarity Joins with Complex Expressions on Distributed Platforms. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11019 LNCS, pp. 216–230). Springer Verlag. https://doi.org/10.1007/978-3-319-98398-1_15

Set Similarity Joins with Complex Expressions on Distributed Platforms

Abstract

Author supplied keywords

Cite

Register to see more suggestions