Discovery of unique column combinations with hadoop

Shupeng Han; Xiangrui Cai; Chao Wang; Haiwei Zhang; Yanlong Wen

Conference Proceedings

Discovery of unique column combinations with hadoop

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8709 LNCS 533-541

DOI: 10.1007/978-3-319-11116-2_49

2Citations

2Readers

Get full text

Abstract

A unique column combination is one important kind of structural information in relations. From a data management perspective, discovering unique column combinations is a crucial step in understanding and utilizing the data. It will benefit data modeling, data integration, anomaly detection, query optimization and indexing. Nevertheless, discovering all unique column combinations is a NP-hard problem. Therefore, efficiency is a tremendous challenge. In this paper, we propose MRUCC, which is an efficient algorithm to discover unique column combinations in large-scale data sets on Hadoop. Existing algorithms mainly focus on datasets of normal size, which cannot be adapted to large data sets. In contrast, we discover unique column combinations in parallel and implement MRUCC on Hadoop. Furthermore, we use column-based and row-based pruning to improve efficiency. Finally, we compare MRUCC with state-of-the-art approaches using both real and synthetic data sets. The experiment shows that MRUCC has a better performance. © 2014 Springer International Publishing Switzerland.

Author supplied keywords

Cite

CITATION STYLE

APA

Han, S., Cai, X., Wang, C., Zhang, H., & Wen, Y. (2014). Discovery of unique column combinations with hadoop. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8709 LNCS, pp. 533–541). Springer Verlag. https://doi.org/10.1007/978-3-319-11116-2_49

Discovery of unique column combinations with hadoop

Abstract

Author supplied keywords

Cite

Register to see more suggestions