GARF+: self-supervised and interpretable data cleaning with sequence generative adversarial networks

0Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Data cleaning has always been a challenging issue in data research. As data volumes grow exponentially, manual cleaning has become increasingly impractical. Despite substantial efforts in automated data cleaning, significant human effort remains essential, either for providing prior knowledge to generate rules or labeling data to train models. In this paper, we study the problem of self-supervised and interpretable data cleaning, which automatically extracts interpretable data repair rules from dirty data. We propose a novel framework, namely Garf+, based on sequence generative adversarial networks (SeqGAN). A key objective of Garf+ is to capture data repair rules (e.g., the city “Dothan” can uniquely determine that the county is “Houston”). Garf+ employs a SeqGAN consisting of a generator G and a discriminator D that trains G to learn the dependency relationships (e.g., given the city “Dothan” as input, G infers that the county should be “Houston”). After training, the generator G can be used to generate data repair rules, but such generated rules may contain incorrect rules, especially when learned from dirty data. To mitigate this problem, Garf+ further updates the learned relationships with another discriminator D′ to iteratively improve the quality of both rules and data. By taking advantage of both logical and learning-based methods, Garf+ achieves interpretable data cleaning without requiring prior knowledge or labeled training data. Furthermore, Garf+ explores the potential of open-source large language models (LLMs) in data cleaning. Through fine-tuning, LLMs can effectively assimilate both general knowledge and domain-specific information. Garf+ integrates LLMs as a knowledge enhancement module to support rule generation and data repair processes. Extensive experiments on real-world and synthetic datasets demonstrate the effectiveness of Garf+, including its original approach (Garf) and two variants designed to tackle various scenarios. Garf+ outperforms state-of-the-art methods with high precision and recall across different datasets, through learning from dirty datasets autonomously without human supervision.

Cite

CITATION STYLE

APA

Peng, J., Cui, H., Shen, D., Tang, N., Kou, Y., Nie, T., … Yu, G. (2025). GARF+: self-supervised and interpretable data cleaning with sequence generative adversarial networks. VLDB Journal, 34(6). https://doi.org/10.1007/s00778-025-00941-9

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free