Learning Segmentation of Documents with Complex Scripts

  • Sesh Kumar K
  • Namboodiri A
  • Jawahar C
N/ACitations
Citations of this article
17Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Most of the state-of-the-art segmentation algorithms are designed to handle complex document layouts and backgrounds, while assuming a simple script structure such as in Roman script. They perform poorly when used with Indian languages, where the components are not strictly collinear. In this paper, we propose a document segmentation algorithm that can handle the complexity of Indian scripts in large document image collections. Segmentation is posed as a graph cut problem that incorporates the apriori information from script structure in the objective function of the cut. We show that this information can be learned automatically and be adapted within a collection of documents (a book) and across collections to achieve accurate segmentation. We show the results on Indian language documents in Telugu script. The approach is also applicable to other languages with complex scripts such as Bangla, Kannada, Malayalam, and Urdu.

Cite

CITATION STYLE

APA

Sesh Kumar, K. S., Namboodiri, A. M., & Jawahar, C. V. (2006). Learning Segmentation of Documents with Complex Scripts (pp. 749–760). https://doi.org/10.1007/11949619_67

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free