OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation

11Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

A multilingual country like India needs language corpora for low-resource languages not only to provide its citizens with technologies of natural language processing (NLP) readily available in other countries, but also to support its people in their education and cultural needs. In this work, we focus on one of the low-resource languages, Odia, and build an Odia–English parallel (OdiEnCorp) and an Odia monolingual (OdiMonoCorp) corpus. The parallel corpus is based on Odia–English parallel texts extracted from online resources and formally corrected by volunteers. We also preprocess the parallel corpus for machine translation research or training. The monolingual corpus comes from a diverse set of online resources and we organize it into a collection of segments and paragraphs, easy to handle by NLP tools. OdiEnCorp parallel corpus contains 29,346 sentence pairs and 756K English and 648K Odia tokens. OdiMonoCorp contains 2.6 million tokens in 221K sentences in 71K paragraphs. Despite their small size, OdiEnCorp and OdiMonoCorp are still the largest Odia language resources, freely available for noncommercial educational or research purposes.

Cite

CITATION STYLE

APA

Parida, S., Bojar, O., & Dash, S. R. (2020). OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation. In Smart Innovation, Systems and Technologies (Vol. 159, pp. 495–504). Springer. https://doi.org/10.1007/978-981-13-9282-5_47

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free