Abstract
Character-level language modeling has been shown empirically to perform well on highly agglutinative or morphologically rich languages while using only a small fraction of the parameters required by (sub)word models. Korean fits nicely into this framework, except that, like other CJK languages, it has a very large character vocabulary of 11,172 unique syllables. However, unlike Japanese Kanji and Chinese Hanzi, each Korean syllable can be uniquely factored into a small set of subcharacters, called jamo. We explore a "three-hot" scheme, where we exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations. We find that our three-hot embedding and decoding scheme alleviates the two major issues with prior syllable- and jamo-level models. Namely, it requires fewer than 1% of the embedding parameters of a syllable model, and it does not require tripling the sequence length, as with jamo models. In addition, it addresses a theoretical flaw in a prior three-hot modeling scheme. Our experiments show that, even when reducing the number of embedding parameters by > 99.6% (from 11.4M to just 36k), our model suffers no loss in translation quality compared to the baseline syllable model.
Cite
CITATION STYLE
Cognetta, M., Wolf-Sonkin, L., Moon, S., & Okazaki, N. (2023). Parameter-Efficient Korean Character-Level Language Modeling. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 2342–2348). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.eacl-main.172
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.