Set Similarity Joins with Complex Expressions on Distributed Platforms

3Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

A set similarity join finds all similar pairs from a collection of sets. This operation is essential for many important tasks in Big Data analytics including string data integration and cleaning. The vast majority of set similarity join algorithms proposed so far considers string data represented by a single set over which a simple similarity predicate is defined. However, real data is typically multi-attribute and, thus, better represented by multiple sets. Such a representation requires complex expressions to capture a given notion of similarity. Moreover, similarity join processing under this new formulation is clearly more expensive, which calls for distributed algorithms to deal with large datasets. In this paper, we present a distributed algorithm for set similarity joins with complex similarity expressions. Our approach supports complex Boolean expressions over multiple predicates. We propose a simple, but effective data partitioning strategy to reduce both communication and computation costs. We have implemented our algorithm in Spark, a popular distributed data processing engine. Experimental results show that the proposed approach is efficient and scalable.

Cite

CITATION STYLE

APA

do Carmo Oliveira, D. J., Borges, F. F., Ribeiro, L. A., & Cuzzocrea, A. (2018). Set Similarity Joins with Complex Expressions on Distributed Platforms. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11019 LNCS, pp. 216–230). Springer Verlag. https://doi.org/10.1007/978-3-319-98398-1_15

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free