Many historical newspapers are being digitized. We aim to support access to them via text analysis of the OCRd content. However, the OCR includes many errors; so extracting meaningful content from it is difficult. A pipeline of processing steps is proposed. Here, we describe the first two steps: segmentation and genre identification. The segmentation procedure based on headings was quite successful. Genre identification worked well for easily defined genre categories such as weather reports. We also propose additional techniques which may improve the accuracy still farther. © 2008 Springer Berlin Heidelberg.
CITATION STYLE
Allen, R. B., Waldstein, I., & Zhu, W. (2008). Automated processing of digitized historical newspapers: Identification of segments and genres. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5362 LNCS, pp. 379–386). Springer Verlag. https://doi.org/10.1007/978-3-540-89533-6_49
Mendeley helps you to discover research relevant for your work.