Semgen—towards a semantic data generator for benchmarking duplicate detectors

2Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability. In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level, before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the domain of road traffic management. A discussion of lessons learned concludes the paper.

Cite

CITATION STYLE

APA

Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W., & Baumgartner, N. (2011). Semgen—towards a semantic data generator for benchmarking duplicate detectors. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6637 LNCS, pp. 490–501). Springer Verlag. https://doi.org/10.1007/978-3-642-20244-5_47

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free