Clustering-based article identification in historical newspapers

Martin Riedl; Daniela Betz; Sebastian Pado

Conference ProceedingsOPEN ACCESS

Clustering-based article identification in historical newspapers

LaTeCH@NAACL-HLT 2019 - 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Proceedings (2019) 12-17

DOI: 10.18653/v1/w19-2502

2Citations

70Readers

Abstract

This article focuses on the problem of identifying articles and recovering their text from within and across newspaper pages when OCR just delivers one text file per page. We frame the task as a segmentation plus clustering step. Our results on a sample of 1912 New York Tribune magazine shows that performing the clustering based on similarities computed with word embeddings outperforms a similarity measure based on character n-grams and words. Furthermore, the automatic segmentation based on the text results in low scores, due to the low quality of some OCRed documents.

Cite

CITATION STYLE

APA

Riedl, M., Betz, D., & Pado, S. (2019). Clustering-based article identification in historical newspapers. In LaTeCH@NAACL-HLT 2019 - 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Proceedings (pp. 12–17). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w19-2502

Clustering-based article identification in historical newspapers

Abstract

Cite

Register to see more suggestions