CMSOF: a structured data organization framework for scanned Chinese medicine books in digital libraries
- ISSN: 18691951
- DOI: 10.1631/jzus.C1001007
Abstract
Organizing unstructured information from books into a well-defined structure is a significant challenge in digital libraries. Most digital libraries can provide only search services at the granularity of books and few libraries allow books to be accessed at the granularity of chapters, as manually constructing directory information for books is time-consuming. Extracting structured data from scanned books thus remains an urgent and important work. In this paper, we propose a novel structured data organization framework called CMSOF to organize scanned data automatically, and apply it to a Chinese medicine digital library. In the framework, image blocks and text blocks on the scanned page of books are separated based on the gray histogram projection method or a hybrid method of region growth and the Ada-Boosting classifier at first, and then the text structure is obtained from text blocks by text size and font type recognition. Finally, image blocks and structured OCRed text are correlated at the semantic level. By integrating the structured data into a Chinese medicine information system (CMIS), we can organize the Chinese medicine books well and users can access the books with flexibility, which indicates that CMSOF is an efficient framework to organize books mixed with images and text.
Author-supplied keywords
CMSOF: a structured data organization framework for scanned Chinese medicine books in digital libraries
CMSOF: a structured data organization framework for
scanned Chinese medicine books in digital libraries*
Jie YUAN, Bao-gang WEI†‡, Li-dong WANG, Wei-ming LU, Yue-ting ZHUANG
(School of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China)
†E-mail: wbg@zju.edu.cn
Received Sept. 14, 2010; Revision accepted Sept. 26, 2010; Crosschecked Sept. 14, 2010
Abstract: Organizing unstructured information from books into a well-defined structure is a significant challenge in digital
libraries. Most digital libraries can provide only search services at the granularity of books and few libraries allow books to be
accessed at the granularity of chapters, as manually constructing directory information for books is time-consuming. Extracting
structured data from scanned books thus remains an urgent and important work. In this paper, we propose a novel structured data
organization framework called CMSOF to organize scanned data automatically, and apply it to a Chinese medicine digital library.
In the framework, image blocks and text blocks on the scanned page of books are separated based on the gray histogram projection
method or a hybrid method of region growth and the Ada-Boosting classifier at first, and then the text structure is obtained from
text blocks by text size and font type recognition. Finally, image blocks and structured OCRed text are correlated at the semantic
level. By integrating the structured data into a Chinese medicine information system (CMIS), we can organize the Chinese
medicine books well and users can access the books with flexibility, which indicates that CMSOF is an efficient framework to
organize books mixed with images and text.
Key words: Digital library, Chinese medicine, Structured data organization, Cross media, Image separation
doi:10.1631/jzus.C1001007 Document code: A CLC number: TP391.4
1 Introduction
With the development of digital libraries, more
and more books have been scanned and preserved
into the libraries. Nowadays, books are more vivid
than before, since they contain not only text but also
images (Fig. 1). Besides, text in books often has
structure, which can be used to provide more
fine-grained services. However, few digital libraries
use this information to enrich the services in libraries.
Large digital libraries such as Google Book (Google
Book Search, http://books.google.com) and CADAL
(China-America Digital Academic Library, http://
www.cadal.zju.edu.cn) have scanned numerous
books, but they provide only metadata-based search
services at the granularity of book.
To make full use of the books, there are several
challenges: (1) Books are scanned in image format
and the book pages are a mixture of text and images;
it is difficult to separate the text block and image
block because layouts and image types vary signifi-
cantly in different books. (2) Text structure informa-
tion, which is very important for organizing infor-
mation, is discarded in almost all Chinese optical
character recognition (OCR) systems. (3) The OCR
errors in books make it difficult to integrate newly
scanned information into the existing information
system. For example, we have scanned many Chi-
nese medicine books containing text and images
which can be used to enhance the Chinese medicine
information system (CMIS) built before; however,
due to OCR errors, items about the same medicine in
CMIS and newly scanned books did not match.
Journal of Zhejiang University-SCIENCE C (Computers & Electronics)
ISSN 1869-1951 (Print); ISSN 1869-196X (Online)
www.zju.edu.cn/jzus; www.springerlink.com
E-mail: jzus@zju.edu.cn
‡ Corresponding author
* Project supported by the China Academic Digital Associative Library
(CADAL)
© Zhejiang University and Springer-Verlag Berlin Heidelberg 2010
There have been some studies on document
structure and layout analysis in recent years (Nam-
boodiri and Jain, 2007; Lu et al., 2008). Article
metadata being generated automatically based on
format and text features extracted from OCRed texts
has also been studied (Lu et al., 2008). While their
research object is a document, generally it has had a
uniform format and images are not taken into account,
so the methodology is not readily generalizable to the
Chinese condition and our application.
Chinese medicine is a valuable cultural heritage
of China, and we have scanned many related books in
CADAL. In this paper, we propose a novel structured
data organization framework called CMSOF for
organizing the scanned Chinese medicine books. At
first, text blocks and image blocks are separated, and
then the text blocks are analyzed to obtain the
structured information. Then the text and images in
the same or adjacent pages are semantically corre-
lated. Finally, the extracted information is integrated
into CMIS to provide fine-grained services.
The primary contributions of the paper are
summarized as follows:
1. We propose a framework called CMSOF to
organize data in scanned books. CMSOF contains
layout analysis with image/text separation, text
structure extraction, image/text semantic relationship
construction, and data integration with CMIS. It can
organize structured cross-media data automatically.
2. We present two different image separating
methods, the gray histogram projection method and a
hybrid method of region growth and the Ada-
Boosting algorithm, to separate image blocks from
text blocks in a scanned page.
3. We present a method to extract text structure
in scanned books by text size and font style reor-
ganization to organize text in a good structure.
4. We propose an integration sub-framework
combining extracted text and image data with the
CMIS established previously. The sub-framework
can correct some OCR errors, and it provides ad-
ministrators a very flexible operation mechanism.
2 Related works
There have been some studies on document
layout analysis, data structure mining, and other data
mining work from scanned books. le Bourgeois et al.
(2004) introduced some general problems of digital
libraries. They classified the problems into two
classes: common problems and particular problems.
Common problems include image details loss owing
to the store format, image post-processing, metadata
auto-extracting, and so on. Particular problems are
some problems that occur in particular application
conditions, such as digital processing of 18th century
European manuscripts. Gatos et al. (2005) proposed
a technique for automatic table detection in docu-
ment images. After pre-processing of document im-
ages, they used mainly morphological operations and
threshold filter to detect table lines. Namboodiri and
Jain (2007) extracted document structure and ana-
lyzed layout from documents with a complex layout.
Lu et al. (2008) proposed a supervised learning based
method to generate description and structure meta-
data of digital books. The OCRed text was stored in
DjVu XML files. This format contains not only plain
OCRed text, but also the logical structure of text and
the surrounding rectangle of every text word. While
in Chinese OCR systems, these functions have not
been provided by now. Liu et al. (2010) proposed a
semi-supervised learning method for detecting
text-lines in noisy document images. They used the
seed filling algorithm for initial segmentation, then
the projection profiles for estimating the vertical
border of page contents, and finally a classifier for
removing speckle noises embedded inside the content
zones. The above methods analyze mainly document
layout without taking images into account, while in
Structured text
Images
Fig. 1 A scanned book page with mixed text and images
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



