Leyzer: A dataset for multilingual virtual assistants

Marcin Sowański; Artur Janicki

Conference Proceedings

Leyzer: A dataset for multilingual virtual assistants

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020) 12284 LNAI 477-486

DOI: 10.1007/978-3-030-58323-1_51

11Citations

8Readers

Get full text

Abstract

In this article we present the Leyzer dataset, a multilingual text corpus designed to study multilingual and cross-lingual natural language understanding (NLU) models and the strategies of localization of virtual assistants. The proposed corpus consists of 20 domains across three languages: English, Spanish and Polish, with 186 intents and a wide range of samples, ranging from 1 to 672 sentences per intent. We describe the data generation process, including creation of grammars and forced parallelization. We present a detailed analysis of the created corpus. Finally, we report the results for two localization strategies: train-on-target and zero-shot learning using multilingual BERT models.

Author supplied keywords

Cite

CITATION STYLE

APA

Sowański, M., & Janicki, A. (2020). Leyzer: A dataset for multilingual virtual assistants. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12284 LNAI, pp. 477–486). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58323-1_51

Leyzer: A dataset for multilingual virtual assistants

Abstract

Author supplied keywords

Cite

Register to see more suggestions