Abstract
Pre-trained language model (LM) has led to significant performance gains in various natural language processing (NLP) applications due to its strong literacy, e.g., the ability to capture word dependencies. However, the existing pre-trained LMs largely ignore numeracy, i.e., treating numbers within text as plain words and without understanding the basic numerical concepts. The weak numeracy has become a barrier to the use of pre-trained LMs in NLP applications over financial documents such as annual filings and analyst reports that are number intensive. However, the understanding and analysis of financial documents are becoming gradationally important. To bridge this gap, this work explores the central theme of numerical pre-training to empower LM with numeracy. In particular, we propose two numerical pre-training methods with objectives that encourage the LM to understand the magnitude and value of numbers and encode the dependency between a number and its context. By applying the proposed methods on BERT, we pre-train two LMs, named BERT-M and BERT-V. Moreover, we construct four datasets of financial documents for evaluating the numeracy of pre-trained LM, which focus on three fundamental perspectives of numeracy: a) number embedding; b) number-text composition; and c) number-number composition. Extensive experiments on the datasets validate the effectiveness of the pre-trained BERT-M and BERT-V, which outperform the state-of-the-art LM for financial documents (FinBERT) by 4.83% and 4.34% on average. Furthermore, their aggregation named BERT-MV increases the gain to 10.88%.
Author supplied keywords
Cite
CITATION STYLE
Feng, F., Rui, X., Wang, W., Cao, Y., & Chua, T. S. (2021). Pre-training and evaluation of numeracy-oriented language model. In ICAIF 2021 - 2nd ACM International Conference on AI in Finance. Association for Computing Machinery, Inc. https://doi.org/10.1145/3490354.3494412
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.