HoloCleanX: A Multi-source Heterogeneous Data Cleaning Solution Based on Lakehouse

2Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The storage of multi-source heterogeneous data has been solved effectively by using Lakehouse, but there are no universal and effective solutions for cleaning in existing systems. Based on Lakehouse MHDP, this paper proposes a cleaning scheme with interactivity based on DCs (Denial Constraints) for cleaning multi-source heterogeneous data. Firstly, we optimize Holoclean to achieve better results on small datasets, which improves F1 by at least 5%. Furthermore, we propose algorithms to parse various types of data, which can effectively reconstruct data. Secondly, we implement an interactive system with real-time feedback which extracts and visualizes the basic metadata and allows users to participate in cleaning work by building DCs. Finally, the cleaned data is saved in the original data format without removing the original data. The experiment results prove that our solution can effectively clean multi-source heterogeneous data with both high accuracy and easy usability.

Cite

CITATION STYLE

APA

Cui, Q., Zheng, W., Hou, W., Sheng, M., Ren, P., Chang, W., & Li, X. Y. (2022). HoloCleanX: A Multi-source Heterogeneous Data Cleaning Solution Based on Lakehouse. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13705 LNCS, pp. 165–176). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-20627-6_16

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free