Modeling content identification from document images

9Citations
Citations of this article
72Readers
Mendeley users who have this article in their library.
Get full text

Abstract

A new technique to locate content-representing words for a given document image using abstract representation of character shapes is described. A character shape code representation defined by the location of a character in a text line has been developed. Character shape code generation avoids the computational expense of conventional optical character recognition (OCR). Because character shape codes are an abstraction of standard character code (e.g., ASCII), the mapping is ambiguous. In this paper, the ambiguity is shown to be practically limited to an acceptable level. It is illustrated that: first, punctuation marks are clearly distinguished from the other characters; second, stop words are generally distinguishable from other words, because the permutations of character shape codes in function words are characteristically different from those in content words; and third, numerals and acronyms in capital letters are distinguishable from other words. With these clAssifications, potential content-representing words are identified, and an analysis of their distribution yields their rank. Consequently, introducing character shape codes makes it possible to inexpensively and robustly bridge the gap between electronic documents and hard-copy documents for the purpose of content identification.

Cite

CITATION STYLE

APA

Nakayama, T. (1994). Modeling content identification from document images. In 4th Conference on Applied Natural Language Processing, ANLP 1994 - Proceedings (pp. 22–27). Association for Computational Linguistics (ACL). https://doi.org/10.3115/974358.974364

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free