DS-prox: Dataset proximity mining for governing the data lake

3Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.
Get full text

Abstract

With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.

Cite

CITATION STYLE

APA

Alserafi, A., Calders, T., Abelló, A., & Romero, O. (2017). DS-prox: Dataset proximity mining for governing the data lake. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10609 LNCS, pp. 284–299). Springer Verlag. https://doi.org/10.1007/978-3-319-68474-1_20

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free