We report on a probabilistic weighting approach to indexing the scanned images of very short documents. This fully automatic process copes with short and very noisy texts (67% word accuracy) derived from the images by Optical Character Recognition (OCR). The probabilistic term weighting approach is based on a theoretical proof explaining how the retrieval effectiveness is affected by recognition errors. We have evaluated our probabilistic weighting approach on a sample of index cards from an alphabetic library catalogue where, on the average, a card contains only 23 terms. We have demonstrated over 30% improvement in retrieval effectiveness over a conventional weighted retrieval method where the recognition errors are not taken into account. We also show how we can take advantage of the ordering information of the alphabetic library catalogue.
CITATION STYLE
Mittendorf, E., Shauble, P., & Sheridan, P. (1995). Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue. In SIGIR Forum (ACM Special Interest Group on Information Retrieval) (pp. 328–335). ACM. https://doi.org/10.1145/215206.215379
Mendeley helps you to discover research relevant for your work.