Abstract
Chinese Spelling Correction (CSC) is the task of detecting and correcting misspelled characters in Chinese texts. As an important step for various downstream tasks, CSC confronts two challenges: 1) Character-level errors consist not only of spelling errors but also of missing and redundant ones that cause variable length between input and output texts, for which most CSC methods could not handle well because of the consistence length of texts required by their inherent detection-correction framework. Consequently, the two errors are considered outside the scope and left to future work, despite the fact that they are widely found and bound to CSC task in Chinese industrial scenario, such as Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR). 2) Most existing CSC methods focus on either detector or corrector and train different models for each one, respectively, leading to insufficiency of parameters sharing. To address these issues, we propose a novel model UMRSpell to learn detection and correction parts together at the same time from a multi-task learning perspective by using a detection transmission self-attention matrix, and flexibly deal with both missing, redundant, and spelling errors through re-tagging rules. Furthermore, we build a new dataset ECMR-2023 containing five kinds of character-level errors to enrich the CSC task closer to real-world applications. Experiments on both SIGHAN benchmarks and ECMR-2023 demonstrate the significant effectiveness of UMRSpell over previous representative baselines.
Cite
CITATION STYLE
He, Z., Zhu, Y., Wang, L., & Xu, L. (2023). UMRSpell: Unifying the Detection and Correction Parts of Pre-trained Models towards Chinese Missing, Redundant, and Spelling Correction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 10238–10250). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.570
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.