Unsupervised data augmentation for less-resourced languages with no standardized spelling

8Citations
Citations of this article
63Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Non-standardized languages are a challenge to the construction of representative linguistic resources and to the development of efficient natural language processing tools: when spelling is not determined by a consensual norm, a multiplicity of alternative written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourcing alternative spellings from which variation rules are automatically extracted. The rules are further used to match out-of-vocabulary words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without requiring manual rule definition by experts. We apply this multilingual methodology on Alsatian, a French regional language and provide (i) an intrinsic evaluation of the correctness of the obtained variants pairs, (ii) an extrinsic evaluation on a downstream task: part-of-speech tagging. We show that in a low-resource scenario, collecting spelling variants for only 145 words can lead to (i) the generation of 876 additional variant pairs, (ii) a diminution of out-of-vocabulary words improving the tagging performance by 1 to 4%.

Cite

CITATION STYLE

APA

Millour, A., & Fort, K. (2019). Unsupervised data augmentation for less-resourced languages with no standardized spelling. In International Conference Recent Advances in Natural Language Processing, RANLP (Vol. 2019-September, pp. 776–784). Incoma Ltd. https://doi.org/10.26615/978-954-452-056-4_090

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free