Shuō Wén Jiě Zì: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

ISSN: 0736587X
3Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese Pretrained Language Models (PLMs) with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.

Cite

CITATION STYLE

APA

Wang, Y., Wang, J., Zhao, D., & Zheng, Z. (2023). Shuō Wén Jiě Zì: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 1089–1101). Association for Computational Linguistics (ACL).

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free