Generating production-like test data that complies with privacy regulations, such as the General Data Protection Regulation (GDPR), is a significant challenge in testing data-intensive software systems. In our previous research, we posed this challenge as a language modeling problem. We trained a language model to capture the statistical properties of production data, and showed that it can effectively generate production-like test data. However, the richness of the generated data in our earlier work was limited by the information capacity of the domain-specific language that we used for representing the data and the training corpus. In this paper, we present an enhanced approach, by using a more expressive domain-specific language with a higher information capacity. We show that using the new domain specific language allowes better leveraging the deep-learning technology and generate even richer, production-like test data. Our experiment results show that with higher information capacity and constraints complexity, the new language performs better regarding generated data quality, with an affordable increase on computational cost.
CITATION STYLE
Tan, C., Behjati, R., & Arisholm, E. (2023). Enhancing Synthetic Test Data Generation with Language Models Using a More Expressive Domain-Specific Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 14131 LNCS, pp. 21–39). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-43240-8_2
Mendeley helps you to discover research relevant for your work.