Automated processing of digitized historical newspapers: Identification of segments and genres

Robert B. Allen; Ilya Waldstein; Weizhong Zhu

Conference Proceedings

Automated processing of digitized historical newspapers: Identification of segments and genres

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2008) 5362 LNCS 379-386

DOI: 10.1007/978-3-540-89533-6_49

4Citations

11Readers

Get full text

Abstract

Many historical newspapers are being digitized. We aim to support access to them via text analysis of the OCRd content. However, the OCR includes many errors; so extracting meaningful content from it is difficult. A pipeline of processing steps is proposed. Here, we describe the first two steps: segmentation and genre identification. The segmentation procedure based on headings was quite successful. Genre identification worked well for easily defined genre categories such as weather reports. We also propose additional techniques which may improve the accuracy still farther. © 2008 Springer Berlin Heidelberg.

Author supplied keywords

Cite

CITATION STYLE

APA

Allen, R. B., Waldstein, I., & Zhu, W. (2008). Automated processing of digitized historical newspapers: Identification of segments and genres. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5362 LNCS, pp. 379–386). Springer Verlag. https://doi.org/10.1007/978-3-540-89533-6_49

Automated processing of digitized historical newspapers: Identification of segments and genres

Abstract

Author supplied keywords

Cite

Register to see more suggestions