Correlation Sketches for Approximate Join-Correlation Queries

30Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.

Abstract

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column Q and a join column KQ from a query table TQ, retrieve tables TX in a dataset collection such that TX is joinable with TQ on KQ and there is a column C g TX such that Q is correlated with C. A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between Q and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

Cite

CITATION STYLE

APA

Santos, A., Bessa, A., Chirigati, F., Musco, C., & Freire, J. (2021). Correlation Sketches for Approximate Join-Correlation Queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1531–1544). Association for Computing Machinery. https://doi.org/10.1145/3448016.3458456

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free