Pretraining large language models has resulted in tremendous performance improvement for many natural language processing (NLP) tasks. While for non-domain specific tasks, such models can be used directly, a common strategy to achieve better performance for specific domains involves pretraining these language models over domain specific data using objectives like Masked Language Modelling (MLM), Autoregressive Language Modelling, etc. While such pretraining addresses the change in vocabulary and style of language for the domain, it is otherwise a domain agnostic approach. In this work, we investigate the effect of incorporating pretraining objectives that explicitly tries to exploit the domain specific language characteristics in addition to such MLM based pretraining. Particularly, we examine two distinct characteristics associated with the legal domain and propose pretraining objectives modelling these characteristics. The proposed objectives target improvement of token-level feature representation, as well as aim to incorporate sentence level semantics. We demonstrate superiority in the performance of the models pretrained using our objectives against those trained using domain-agnostic objectives over several legal downstream tasks.
CITATION STYLE
Nair, I., & Modani, N. (2023). Exploiting Language Characteristics for Legal Domain-Specific Language Model Pretraining. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023 (pp. 2471–2481). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-eacl.190
Mendeley helps you to discover research relevant for your work.