In recent years, the amount of data is growing extensively. In companies, spreadsheets are one common approach to conduct data processing and statistical analysis. However, especially when working with massive amounts of data, spreadsheet applications have their limitations. To cope with this issue, we introduce a human-in-the-loop approach for scalable data preprocessing using sampling. In contrast to state-of-the-art approaches, we also consider conflict resolution and recommendations based on data not contained in the sample itself. We implemented a fully functional prototype and conducted a user study with 12 participants. We show that our approach delivers a significantly higher error correction than comparable approaches which only consider the sample dataset.
CITATION STYLE
Behringer, M., Hirmer, P., Fritz, M., & Mitschang, B. (2020). Empowering domain experts to preprocess massive distributed datasets. In Lecture Notes in Business Information Processing (Vol. 389 LNBIP, pp. 61–75). Springer. https://doi.org/10.1007/978-3-030-53337-3_5
Mendeley helps you to discover research relevant for your work.