Machine Translation (MT) is the process of automatic conversion of text from the source language into a target language preserving the meaning in the source text. Large parallel corpora are the essential resources to build any MT model. However, most of the languages are under-resourced due to lack of computational tools and digital resources with respect to parallel corpora for MT. Further, translation of under-resourced languages with complex morphological structures are more challenging. In view of these factors, this paper describes the practical approaches to develop MT systems for Kannada-Sanskrit language pair from scratch. This work comprises of the construction of KanSan - a parallel corpus for Kannada-Sanskrit language pair and implementation of MT baselines for translating Kannada text to Sanskrit text and vice versa. The models, namely: Recurrent Neural Network (RNN), Bidirectional Recurrent Neural Network (BiRNN), transformer-based Neural Machine Translation (NMT) with and without subword tokenization, and Statistical Machine Translation (SMT) are implemented for MT of Kannada text to Sanskrit text and vice versa. The performance of MT models is measured in terms of Bilingual Evaluation Understudy (BLEU) score. Among all the models, the transformer-based model with subword tokenization performed best with BLEU scores of 9.84 and 12.63 for Kannada to Sanskrit and Sanskrit to Kannada MT respectively.
CITATION STYLE
Hegde, A., & Shashirekha, H. L. (2023). KanSan: Kannada-Sanskrit Parallel Corpus Construction for Machine Translation. In Communications in Computer and Information Science (Vol. 1802 CCIS, pp. 3–18). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-33231-9_1
Mendeley helps you to discover research relevant for your work.