Abstract
A major challenge for statistical machine translation (SMT) of Arabic-to-English user-generated text is the prevalence of text written in Arabizi, or Romanized Arabic. When facing such texts, a translation system trained on conventional Arabic-English data will suffer from extremely low model coverage. In addition, Arabizi is not regulated by any official standardization and therefore highly ambiguous, which prevents rule-based approaches from achieving good translation results. In this paper, we improve Arabizi-to-English machine translation by presenting a simple but effective Arabizi-to-Arabic transliteration pipeline that does not require knowledge by experts or native Arabic speakers. We incorporate this pipeline into a phrase-based SMT system, and show that translation quality after automatically transliterating Arabizi to Arabic yields results that are comparable to those achieved after human transliteration.
Cite
CITATION STYLE
van der Wees, M., Bisazza, A., & Monz, C. (2016). A Simple but Effective Approach to Improve Arabizi-to-English Statistical Machine Translation. Proceedings of the 2nd Workshop on Noisy User-Generated Text (WNUT), 43–50. Retrieved from https://aclanthology.org/W16-3908.pdf
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.