A unique column combination is one important kind of structural information in relations. From a data management perspective, discovering unique column combinations is a crucial step in understanding and utilizing the data. It will benefit data modeling, data integration, anomaly detection, query optimization and indexing. Nevertheless, discovering all unique column combinations is a NP-hard problem. Therefore, efficiency is a tremendous challenge. In this paper, we propose MRUCC, which is an efficient algorithm to discover unique column combinations in large-scale data sets on Hadoop. Existing algorithms mainly focus on datasets of normal size, which cannot be adapted to large data sets. In contrast, we discover unique column combinations in parallel and implement MRUCC on Hadoop. Furthermore, we use column-based and row-based pruning to improve efficiency. Finally, we compare MRUCC with state-of-the-art approaches using both real and synthetic data sets. The experiment shows that MRUCC has a better performance. © 2014 Springer International Publishing Switzerland.
CITATION STYLE
Han, S., Cai, X., Wang, C., Zhang, H., & Wen, Y. (2014). Discovery of unique column combinations with hadoop. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8709 LNCS, pp. 533–541). Springer Verlag. https://doi.org/10.1007/978-3-319-11116-2_49
Mendeley helps you to discover research relevant for your work.