Efficient search in hidden text of large DjVu documents

Janusz S. Bień

Conference Proceedings

Efficient search in hidden text of large DjVu documents

Bień J

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2011) 6699 LNCS 1-14

DOI: 10.1007/978-3-642-23160-5_1

5Citations

2Readers

Get full text

Abstract

The paper describes an open-source tool which allows to present end-users with results of advanced language technologies. It relies on the DjVu format, which for some applications is still superior to other modern formats including PDF/A. The DjVu GPLed tools are not limited just to the DjVuLibre library, but are being supplemented by various new programs, such as pdf2djvu developed by Jakub Wilk. It allows in particular to convert to DjVu the PDF output of popular OCR programs like FineReader preserving the hidden text layer and some other features. The tool in question has been conceived by the present author and consist of a modification of the Poliqarp corpus query tool, used for National Corpus of Polish; his ideas have been very succesfully implemented by Jakub Wilk. The new system, called here simply Poliqarp for DjVu, inherits from its origin not only the powerfull search facilities based two-level regular expressions, but also the ability to represent low-level ambiguities and other linguistic phenomena. Although at present the tool is used mainly to facilitate access to the results of dirty OCR, it is ready to handle also more sophisticated output of linguistic technologies. © 2011 Springer-Verlag.

Cite

CITATION STYLE

APA

Bień, J. S. (2011). Efficient search in hidden text of large DjVu documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6699 LNCS, pp. 1–14). https://doi.org/10.1007/978-3-642-23160-5_1

Efficient search in hidden text of large DjVu documents

Abstract

Cite

Register to see more suggestions