Discovery of unique column combinations with hadoop

2Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

A unique column combination is one important kind of structural information in relations. From a data management perspective, discovering unique column combinations is a crucial step in understanding and utilizing the data. It will benefit data modeling, data integration, anomaly detection, query optimization and indexing. Nevertheless, discovering all unique column combinations is a NP-hard problem. Therefore, efficiency is a tremendous challenge. In this paper, we propose MRUCC, which is an efficient algorithm to discover unique column combinations in large-scale data sets on Hadoop. Existing algorithms mainly focus on datasets of normal size, which cannot be adapted to large data sets. In contrast, we discover unique column combinations in parallel and implement MRUCC on Hadoop. Furthermore, we use column-based and row-based pruning to improve efficiency. Finally, we compare MRUCC with state-of-the-art approaches using both real and synthetic data sets. The experiment shows that MRUCC has a better performance. © 2014 Springer International Publishing Switzerland.

Cite

CITATION STYLE

APA

Han, S., Cai, X., Wang, C., Zhang, H., & Wen, Y. (2014). Discovery of unique column combinations with hadoop. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8709 LNCS, pp. 533–541). Springer Verlag. https://doi.org/10.1007/978-3-319-11116-2_49

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free