Parameter-Efficient Korean Character-Level Language Modeling

0Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Character-level language modeling has been shown empirically to perform well on highly agglutinative or morphologically rich languages while using only a small fraction of the parameters required by (sub)word models. Korean fits nicely into this framework, except that, like other CJK languages, it has a very large character vocabulary of 11,172 unique syllables. However, unlike Japanese Kanji and Chinese Hanzi, each Korean syllable can be uniquely factored into a small set of subcharacters, called jamo. We explore a "three-hot" scheme, where we exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations. We find that our three-hot embedding and decoding scheme alleviates the two major issues with prior syllable- and jamo-level models. Namely, it requires fewer than 1% of the embedding parameters of a syllable model, and it does not require tripling the sequence length, as with jamo models. In addition, it addresses a theoretical flaw in a prior three-hot modeling scheme. Our experiments show that, even when reducing the number of embedding parameters by > 99.6% (from 11.4M to just 36k), our model suffers no loss in translation quality compared to the baseline syllable model.

Cite

CITATION STYLE

APA

Cognetta, M., Wolf-Sonkin, L., Moon, S., & Okazaki, N. (2023). Parameter-Efficient Korean Character-Level Language Modeling. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 2342–2348). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.eacl-main.172

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free