Sesame Street to Mount Sinai: BERT-constrained character-level Moses models for multilingual lexical normalization

3Citations
Citations of this article
40Readers
Mendeley users who have this article in their library.

Abstract

This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization. Our system is based on a BERT token classification preprocessing step, where for each token the type of the necessary transformation is predicted (none, uppercase, lowercase, capitalize, modify), and a character-level statistical machine translation step where the text is translated from original to normalized given the BERT-predicted transformation constraints. For some languages, depending on the results on development data, the training data was extended by back-translating OpenSubtitles data. In the final ordering of the ten participating teams, the HEL-LJU team has taken the second place, scoring better than the previous state-of-the-art.

Cite

CITATION STYLE

APA

Scherrer, Y., & Ljubešić, N. (2021). Sesame Street to Mount Sinai: BERT-constrained character-level Moses models for multilingual lexical normalization. In W-NUT 2021 - 7th Workshop on Noisy User-Generated Text, Proceedings of the Conference (pp. 465–472). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.wnut-1.52

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free