Leyzer: A dataset for multilingual virtual assistants

11Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this article we present the Leyzer dataset, a multilingual text corpus designed to study multilingual and cross-lingual natural language understanding (NLU) models and the strategies of localization of virtual assistants. The proposed corpus consists of 20 domains across three languages: English, Spanish and Polish, with 186 intents and a wide range of samples, ranging from 1 to 672 sentences per intent. We describe the data generation process, including creation of grammars and forced parallelization. We present a detailed analysis of the created corpus. Finally, we report the results for two localization strategies: train-on-target and zero-shot learning using multilingual BERT models.

Cite

CITATION STYLE

APA

Sowański, M., & Janicki, A. (2020). Leyzer: A dataset for multilingual virtual assistants. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12284 LNAI, pp. 477–486). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58323-1_51

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free