Detection of poor quality data is crucial for enhancing data-driven systems' quality. Although there is a lot of research on data validation, the topic of potential data quality issues is still underexplored. Such latent issues or data smells can often stay undetected and lead to the poor future performance of data-intensive systems. Detecting data smells is not trivial and requires knowledge about their causes. In this paper, we present the preliminary findings on the causes and severity of data smells based on a study of a real-world business travel data set and the data processing pipeline behind it. The results show that data smells exist in this data set and cause severe problems. Although many data smells already occur in raw data, some smells are created during the transformation and enrichment stages of the data processing pipeline. These findings indicate the importance of the data pipeline itself for future research on data smells. Thus, this article proposes potential future work in this area.
CITATION STYLE
Golendukhina, V., Foidl, H., Felderer, M., & Ramler, R. (2022). Preliminary findings on the occurrence and causes of data smells in a real-world business travel data processing pipeline. In SEA4DQ 2022 - Proceedings of the 2nd International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things, co-located with ESEC/FSE 2022 (pp. 18–21). Association for Computing Machinery, Inc. https://doi.org/10.1145/3549037.3561275
Mendeley helps you to discover research relevant for your work.