Abstract
A new technique to locate content-representing words for a given document image using abstract representation of character shapes is described. A character shape code representation defined by the location of a character in a text line has been developed. Character shape code generation avoids the computational expense of conventional optical character recognition (OCR). Because character shape codes are an abstraction of standard character code (e.g., ASCII), the mapping is ambiguous. In this paper, the ambiguity is shown to be practically limited to an acceptable level. It is illustrated that: first, punctuation marks are clearly distinguished from the other characters; second, stop words are generally distinguishable from other words, because the permutations of character shape codes in function words are characteristically different from those in content words; and third, numerals and acronyms in capital letters are distinguishable from other words. With these clAssifications, potential content-representing words are identified, and an analysis of their distribution yields their rank. Consequently, introducing character shape codes makes it possible to inexpensively and robustly bridge the gap between electronic documents and hard-copy documents for the purpose of content identification.
Cite
CITATION STYLE
Nakayama, T. (1994). Modeling content identification from document images. In 4th Conference on Applied Natural Language Processing, ANLP 1994 - Proceedings (pp. 22–27). Association for Computational Linguistics (ACL). https://doi.org/10.3115/974358.974364
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.