KanSan: Kannada-Sanskrit Parallel Corpus Construction for Machine Translation

Asha Hegde; Hosahalli Lakshmaiah Shashirekha

Conference Proceedings

KanSan: Kannada-Sanskrit Parallel Corpus Construction for Machine Translation

Communications in Computer and Information Science (2023) 1802 CCIS 3-18

DOI: 10.1007/978-3-031-33231-9_1

0Citations

3Readers

Get full text

Abstract

Machine Translation (MT) is the process of automatic conversion of text from the source language into a target language preserving the meaning in the source text. Large parallel corpora are the essential resources to build any MT model. However, most of the languages are under-resourced due to lack of computational tools and digital resources with respect to parallel corpora for MT. Further, translation of under-resourced languages with complex morphological structures are more challenging. In view of these factors, this paper describes the practical approaches to develop MT systems for Kannada-Sanskrit language pair from scratch. This work comprises of the construction of KanSan - a parallel corpus for Kannada-Sanskrit language pair and implementation of MT baselines for translating Kannada text to Sanskrit text and vice versa. The models, namely: Recurrent Neural Network (RNN), Bidirectional Recurrent Neural Network (BiRNN), transformer-based Neural Machine Translation (NMT) with and without subword tokenization, and Statistical Machine Translation (SMT) are implemented for MT of Kannada text to Sanskrit text and vice versa. The performance of MT models is measured in terms of Bilingual Evaluation Understudy (BLEU) score. Among all the models, the transformer-based model with subword tokenization performed best with BLEU scores of 9.84 and 12.63 for Kannada to Sanskrit and Sanskrit to Kannada MT respectively.

Author supplied keywords

Cite

CITATION STYLE

APA

Hegde, A., & Shashirekha, H. L. (2023). KanSan: Kannada-Sanskrit Parallel Corpus Construction for Machine Translation. In Communications in Computer and Information Science (Vol. 1802 CCIS, pp. 3–18). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-33231-9_1

KanSan: Kannada-Sanskrit Parallel Corpus Construction for Machine Translation

Abstract

Author supplied keywords

Cite

Register to see more suggestions