TabReformer: Unsupervised Representation Learning for Erroneous Data Detection

  • Nashaat M
  • Ghosh A
  • Miller J
  • et al.
N/ACitations
Citations of this article
16Readers
Mendeley users who have this article in their library.

Abstract

Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.

Cite

CITATION STYLE

APA

Nashaat, M., Ghosh, A., Miller, J., & Quader, S. (2021). TabReformer: Unsupervised Representation Learning for Erroneous Data Detection. ACM/IMS Transactions on Data Science, 2(3), 1–29. https://doi.org/10.1145/3447541

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free