Enhancing Synthetic Test Data Generation with Language Models Using a More Expressive Domain-Specific Language

0Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Generating production-like test data that complies with privacy regulations, such as the General Data Protection Regulation (GDPR), is a significant challenge in testing data-intensive software systems. In our previous research, we posed this challenge as a language modeling problem. We trained a language model to capture the statistical properties of production data, and showed that it can effectively generate production-like test data. However, the richness of the generated data in our earlier work was limited by the information capacity of the domain-specific language that we used for representing the data and the training corpus. In this paper, we present an enhanced approach, by using a more expressive domain-specific language with a higher information capacity. We show that using the new domain specific language allowes better leveraging the deep-learning technology and generate even richer, production-like test data. Our experiment results show that with higher information capacity and constraints complexity, the new language performs better regarding generated data quality, with an affordable increase on computational cost.

Cite

CITATION STYLE

APA

Tan, C., Behjati, R., & Arisholm, E. (2023). Enhancing Synthetic Test Data Generation with Language Models Using a More Expressive Domain-Specific Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 14131 LNCS, pp. 21–39). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-43240-8_2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free