Abstract
Non-standardized languages are a challenge to the construction of representative linguistic resources and to the development of efficient natural language processing tools: when spelling is not determined by a consensual norm, a multiplicity of alternative written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourcing alternative spellings from which variation rules are automatically extracted. The rules are further used to match out-of-vocabulary words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without requiring manual rule definition by experts. We apply this multilingual methodology on Alsatian, a French regional language and provide (i) an intrinsic evaluation of the correctness of the obtained variants pairs, (ii) an extrinsic evaluation on a downstream task: part-of-speech tagging. We show that in a low-resource scenario, collecting spelling variants for only 145 words can lead to (i) the generation of 876 additional variant pairs, (ii) a diminution of out-of-vocabulary words improving the tagging performance by 1 to 4%.
Cite
CITATION STYLE
Millour, A., & Fort, K. (2019). Unsupervised data augmentation for less-resourced languages with no standardized spelling. In International Conference Recent Advances in Natural Language Processing, RANLP (Vol. 2019-September, pp. 776–784). Incoma Ltd. https://doi.org/10.26615/978-954-452-056-4_090
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.