Enhancing Synthetic Test Data Generation with Language Models Using a More Expressive Domain-Specific Language

Chao Tan; Razieh Behjati; Erik Arisholm

Conference Proceedings

Enhancing Synthetic Test Data Generation with Language Models Using a More Expressive Domain-Specific Language

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2023) 14131 LNCS 21-39

DOI: 10.1007/978-3-031-43240-8_2

0Citations

3Readers

Get full text

Abstract

Generating production-like test data that complies with privacy regulations, such as the General Data Protection Regulation (GDPR), is a significant challenge in testing data-intensive software systems. In our previous research, we posed this challenge as a language modeling problem. We trained a language model to capture the statistical properties of production data, and showed that it can effectively generate production-like test data. However, the richness of the generated data in our earlier work was limited by the information capacity of the domain-specific language that we used for representing the data and the training corpus. In this paper, we present an enhanced approach, by using a more expressive domain-specific language with a higher information capacity. We show that using the new domain specific language allowes better leveraging the deep-learning technology and generate even richer, production-like test data. Our experiment results show that with higher information capacity and constraints complexity, the new language performs better regarding generated data quality, with an affordable increase on computational cost.

Author supplied keywords

Cite

CITATION STYLE

APA

Tan, C., Behjati, R., & Arisholm, E. (2023). Enhancing Synthetic Test Data Generation with Language Models Using a More Expressive Domain-Specific Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 14131 LNCS, pp. 21–39). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-43240-8_2

Enhancing Synthetic Test Data Generation with Language Models Using a More Expressive Domain-Specific Language

Abstract

Author supplied keywords

Cite

Register to see more suggestions