Automatic extraction of web data records containing user-generated content
- ISBN: 9781450300995
- DOI: 10.1145/1871437.1871447
Abstract
In this paper, we are concerned with the problem of automatically extracting web data records that contain user-generated content (UGC). In previous work, web data records are usually assumed to be well-formed with a limited amount of UGC, and thus can be extracted by testing repetitive structure similarity. However, when a web data record includes a large portion of free-format UGC, the similarity test between records may fail, which in turn results in lower performance. In our work, we find that certain domain constraints (e.g., post-date) can be used to design better similarity measures capable of circumventing the influence of UGC. In addition, we also use anchor points provided by the domain constraints to improve the extraction process, which ends in an algorithm called MiBAT (Mining data records Based on Anchor Trees). We conduct extensive experiments on a dataset consisting of forum thread pages which are collected from 307 sites that cover 219 different forum software packages. Our approach achieves a precision of 98.9% and a recall of 97.3% with respect to post record extraction. On page level, it perfectly handles 91.7% of pages without extracting any wrong posts or missing any golden posts. We also apply our approach to comment extraction and achieve good results as well.
Author-supplied keywords
Automatic extraction of web data records containing user-generated content
User-Generated Content∗
Xinying Song†, Jing Liu†, Yunbo Cao‡, Chin-Yew Lin‡, and Hsiao-Wuen Hon‡
† Harbin Institute of Technology, Harbin 150001, P.R.China
‡ Microsoft Research Asia, Beijing 100190, P.R.China
xysong@mtlab.hit.edu.cn, jliu@ir.hit.edu.cn, {yunbo.cao, cyl, hon}@microsoft.com
ABSTRACT
In this paper, we are concerned with the problem of au-
tomatically extracting web data records that contain user-
generated content (UGC). In previous work, web data records
are usually assumed to be well-formed with a limited amount
of UGC, and thus can be extracted by testing repetitive
structure similarity. However, when a web data record in-
cludes a large portion of free-format UGC, the similarity
test between records may fail, which in turn results in lower
performance. In our work, we find that certain domain con-
straints (e.g., post-date) can be used to design better similar-
ity measures capable of circumventing the influence of UGC.
In addition, we also use anchor points provided by the do-
main constraints to improve the extraction process, which
ends in an algorithm called MiBAT (Mining data records
Based on Anchor Trees). We conduct extensive experiments
on a dataset consisting of forum thread pages which are col-
lected from 307 sites that cover 219 different forum software
packages. Our approach achieves a precision of 98.9% and
a recall of 97.3% with respect to post record extraction. On
page level, it perfectly handles 91.7% of pages without ex-
tracting any wrong posts or missing any golden posts. We
also apply our approach to comment extraction and achieve
good results as well.
Categories and Subject Descriptors
H.3.m [Information Storage and Retrieval]: Miscella-
neous - Data Extraction; Web
General Terms
Algorithms, Performance, Experimentation
Keywords
User-generated content, information extraction, structured
data
∗This work was done when Xinying Song and Jing Liu were
visiting students at Microsoft Research Asia.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM’10, October 26–30, 2010, Toronto, Ontario, Canada.
Copyright 2010 ACM 978-1-4503-0099-5/10/10 ...$10.00.
1. INTRODUCTION
Web 2.0, web applications that encourage user participa-
tion, is a well known concept nowadays and is becoming
more and more popular. Along with its popularity, enor-
mous valuable knowledge and information, which we call
user-generated content (UGC), has been accumulated over
years and still keeps growing. Extracting this valuable web
data in an automatic and scalable manner can benefit a lot
of applications like question answering [22], blog or review
mining [10], and expert search on web communities.
Typically, web pages generated by Web 2.0 applications
contain a large amount of UGC, such as forum posts, blogs,
reviews, comments, etc. According to Wikipedia, UGC refers
to “various kinds of media content, publicly available, that
are produced by end-users”, and thus has high diversity in
both content and format. In this paper we focus on tackling
the complexity of extracting web data records containing
UGC. Hereafter, for the ease of presentation, we will use
as our primary example the application of extracting posts
from web forums (as shown in Fig. 1) although our approach
can be applied to other types of applications as well.
Figure 1: A typical web forum thread page, showing
two posts and one embedded advertisement bar
Web data extraction has been a hot research topic [4] in
recent years. Recent work mainly follows two categories
of approaches: semi-automatic and fully automatic. Semi-
automatic approaches require manually labeled data for ei-
on a tree-structured template [6, 8, 25, 26], or training su-
pervised statistical models on a specific domain [20, 27]. Due
to the laborious nature of labeling, such semi-automatic ap-
proaches are not scalable for web scale data extraction.
In contrast, fully automatic approaches do not require
any labeled data. Such approaches mainly study two sub-
categories of problems: (1) extracting a list of data objects
(records) from a single page and (2) learning a template
from multiple pages of the same type [2, 7]. The prob-
lem we study in this paper falls into the first sub-category.
One of the representative approaches is MDR (Mining Data
Records in Web pages) [12, 13] (including its extension work
[14, 17, 23]). On the basis of MDR we are to develop our
own approach. MDR identifies a list of records by conduct-
ing a similarity test against a pre-defined threshold for two
sub-trees in the DOM tree of a web page. Such a method
is referred to as the similarity-based approach [15], because
the underlying assumption is that data records belonging to
the same list usually have similar DOM tree structures.
Web data records containing UGC usually consist of two
parts: well-formatted structured data (e.g., author, publica-
tion date, etc.), referred to as the template part, and free-
format unstructured UGC. Due to the existence of UGC, the
values of similarity between data records may vary greatly,
which makes it less practical to set a good and robust simi-
larity threshold and thus results in failure of the similarity-
based approach. Fig. 2 shows the tree alignment for the
two posts in Fig. 1. We can see that the two records look
dissimilar due to the existence of the large portion of UGC.
Intuitively the problem can be solved if we are able to dif-
ferentiate the structured template from unstructured UGC
on DOM trees and use the template part to perform the
similarity test. However, it is not easy to make such differ-
entiation in an accurate and robust way.
Figure 2: Tree match of two posts (gray triangles
denote UGC while gray rectangles denote post-date)
Inspired by domain dependent work [20, 27], we find that
some domain dependent constraints help detect the appro-
priate part of the tree for the similarity test. For example,
for extracting posts from web forums, a good and intuitive
constraint will be the post-date (publication date of a post)
because it is a part of the structured data of posts occur-
ring in every post and also can be easily identified (Fig. 2).
Motivated by this intuition, we propose two similarity mea-
sures to solve the difficulty caused by UGC. Note that, in
addition to forums, almost all types of web data records con-
taining UGC have post-date, such as blogs, user comments
(e.g. Twitter, Flickr, YouTube, Digg) or reviews (e.g. Ama-
zon), etc. Therefore, our proposal is not restricted to forum
sites.
Domain constraints also provide strong anchor point infor-
mation for data record detection. For example, each forum
post must contain exactly one sub-tree containing post-date.
We proposed a novel data record extraction algorithm in-
spired by this intuition.
In summary, in this paper we aim to solve the problem of
extracting from a single page a list of web data records that
contain UGC in a fully automatic way. Previous work in this
topic usually focuses on data objects containing no UGC, for
example product lists [13, 23], search engine results [5, 17,
24] or DBLP literature reference records [15]. None of them
explicitly claim to take care of the UGC part. Yang et al.
[20] work on forum data extraction but in a semi-automatic
way. Our contributions are as follows:
• We formulate similarity measures and propose to in-
corporate domain constraints to help design good sim-
ilarity measures, on the basis of which an MDR-like
similarity-based approach can overcome the similarity
test issue caused by UGC (Sec. 4).
• We propose a novel mining algorithm called MiBAT
which makes use of domain constraints to acquire an-
chor point information. Compared to MDR, MiBAT
can not only extract non-consecutive data records, but
also overcomes MDR’s greedy deficiency [15] (Sec. 5).
• We develop a dataset collected from 307 forum sites
formatted in 219 different forum software packages,
on which our method achieves a satisfactory result of
98.9% in precision and 97.3% in recall (Sec. 6.1). To
the best of our knowledge, this is the most comprehen-
sive evaluation on forum post extraction.
2. RELATED WORK
Web data extraction has been an extensively studied re-
search topic in recent years, resulting in a rich variety of
approaches. We discuss highly relevant work here and refer
the readers to a survey [4] for further study.
Early work on automatically extracting data records from
a single page employs a set of heuristic rules to identify
data record boundaries, including [9] and OMINI [3]. Later
work is based on repetitive pattern mining from HTML tag
sequences, such as IEPAD [5] and Dela [19]. Recent work
is based on similar sub-tree mining on the DOM tree of the
web page, represented by MDR [13]. It is reported in [13]
that MDR outperforms both OMINI and IEPAD.
Due to its simplicity and effectiveness, MDR has attracted
wide research interests and been extended in many studies.
One improvement direction is incorporating visual layout
information [17, 23, 24]. However, visual features usually
require proper rendering with additional resources (such as
CSS files), thus not being always available and generally
helpful. Our work in this paper is purely based on the DOM
tree structure without incorporating any visual features. We
will show by experimental results that such a pure tag-tree
based approach achieves satisfactory performance as well.
Web data can be a relation of k-tuple (where each record
has k attributes), or a complex object with a hierarchical
structure like nested lists [4]. The former is called flat and
the latter nested. In this paper we mainly focus on flat data,
for web data records containing UGC are usually displayed
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


