BAHP: Benchmark of Assessing Word Embeddings in Historical Portuguese

Zuoyu Tian; Dylan Jarrett; Juan Manuel Escalona Torres; Patrícia Amaral

Conference Proceedings

BAHP: Benchmark of Assessing Word Embeddings in Historical Portuguese

5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, LaTeCHCLfL 2021 - Co-located with the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 - Proceedings (2021) 113-119

DOI: 10.18653/v1/2021.latechclfl-1.13

2Citations

41Readers

Get full text

Abstract

High quality distributional models can capture lexical and semantic relations between words. Hence, researchers design various intrinsic tasks to test whether such relations are captured. However, most of the intrinsic tasks are designed for modern languages, and there is a lack of evaluation methods for distributional models of historical corpora. In this paper, we conducted BAHP: a benchmark of assessing word embeddings in Historical Portuguese, which contains four types of tests: analogy, similarity, outlier detection, and coherence. We examined word2vec models generated from two historical Portuguese corpora in these four test sets. The results demonstrate that our test sets are capable of measuring the quality of vector space models and can provide a holistic view of the model’s ability to capture syntactic and semantic information. Furthermore, the methodology for the creation of our test sets can be easily extended to other historical languages.

Cite

CITATION STYLE

APA

Tian, Z., Jarrett, D., Torres, J. M. E., & Amaral, P. (2021). BAHP: Benchmark of Assessing Word Embeddings in Historical Portuguese. In 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, LaTeCHCLfL 2021 - Co-located with the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 - Proceedings (pp. 113–119). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.latechclfl-1.13

BAHP: Benchmark of Assessing Word Embeddings in Historical Portuguese

Abstract

Cite

Register to see more suggestions