Passage Retrieval for Information Extraction using Distant Supervision
Abstract
In this paper, we propose a keyword-based passage retrieval algorithm for information ex- traction, trained by distant supervision. Our goal is to be able to extract attributes of people and organizations more quickly and accurately by first ranking all the potentially relevant passages according to their likelihood of con- taining the answer and then performing a tradi- tional deeper, slower analysis of individual passages. Using Freebase as our source of known relation instances and Wikipedia as our text source, we collected a weighted set of keywords indicative of each relation and then use it to re-rank the passages retrieved by the Lemur search engine. Experiments show that our algorithm significantly outperforms state- of-the-art passage retrieval techniques in eval- uations of both individual passage retrieval and end-to-end information extraction.
Author-supplied keywords
Passage Retrieval for Information Extraction using Distant Supervision
Chiang Mai, Thailand, November 8 – 13, 2011. c
2011 AFNLP
Passage Retrieval for Information Extraction
using Distant Supervision
Wei Xu° Ralph Grishman° Le Zhao*
°New York University
New York, NY, USA
xuwei,grishman@cs.nyu.edu
*Carnegie Mellon University
Pittsburgh, PA, USA
lezhao@cs.cmu.edu
Abstract
In this paper, we propose a keyword-based
passage retrieval algorithm for information ex-
traction, trained by distant supervision. Our
goal is to be able to extract attributes of people
and organizations more quickly and accurately
by first ranking all the potentially relevant
passages according to their likelihood of con-
taining the answer and then performing a tradi-
tional deeper, slower analysis of individual
passages. Using Freebase as our source of
known relation instances and Wikipedia as our
text source, we collected a weighted set of
keywords indicative of each relation and then
use it to re-rank the passages retrieved by the
Lemur search engine. Experiments show that
our algorithm significantly outperforms state-
of-the-art passage retrieval techniques in eval-
uations of both individual passage retrieval
and end-to-end information extraction.
1 Introduction
Large-corpus information extraction involves the
extraction of pre-specified types of relations and
events from large corpora. For example, the
Knowledge Base Population (KBP) slot-filling
task (Ji et al., 2010) involves finding, from a
large corpus, a few dozen attributes of a speci-
fied person or organization.
In many cases we do not have the time to per-
form in-depth extraction for all attributes over
the entire corpus. Consequently, addressing this
task typically involves a blend of traditional
question answering (QA) and information extrac-
tion (IE) methods. Like QA, we need to begin
with passage retrieval, where a passage can range
from a sentence to a piece of text or a document.
However, unlike QA, we have a fixed inventory
of relations and a fixed set of expected answer
types (e.g. employer of a person). This allows us
to bring to bear the more specialized learning
methods of IE to tune the passage retrieval for
each relation of interest.
To the best of our knowledge, we are the first
to systematically study the passage retrieval al-
gorithm for information extraction and propose a
novel distant supervision approach to obtain a
list of weighted keywords for each relation. Dis-
tant supervision (Mintz et al., 2009) makes use of
noisy training data generated automatically from
a related, but different, type dataset to solve
problems on another type of data. Instead of a
handful of human-selected keywords, we auto-
matically learn hundreds or thousands of indica-
tive keywords from a freely available online re-
source, Freebase, which is similar to Wikipedia
Infoboxes. Passages are ranked and retrieved
based on these keywords indicative of certain
relations. We then feed individual passages to a
traditional IE system or to an answer extraction
component as used in QA systems to obtain the
final outputs. Both the training and testing pro-
cedures of our method require only statistics of
surface words and named entities in the text and
thus are time efficient.
This paper addresses the following questions:
1) How can we tune passage retrieval for a
particular relation?
2) How do distant learning methods apply to
the passage retrieval task?
3) How much do these methods improve over
typical QA passage retrieval?
We will measure the improvement in two
ways:
1) ability to find a relevant passage, such as
reduction in the number of passages the system
must examine and increase in the proportion of
relevant passages in top-ranked ones;
1046
information extraction by taking passage rele-
vance into account.
2 Previous Work
Relatively little work has been done to investi-
gate in detail the quality of the IR for large-
corpus IE and take advantage of the more con-
strained relations of interest compared to tradi-
tional QA. The Knowledge Base Population
(KBP) track at TAC 2010 (Ji et al., 2010) evalu-
ates the ability of automated systems to discover
information about named entities. Its slot-filling
task is to find answers to queries asking a few
dozen attributes of a specified entity, such as the
„employee_of‟ attribute of a given PERSON en-
tity. We refer to the given entity as the target
entity, and the attributes of entities as slot types.
In past KBP competitions, many participants (Li
et al., 2009; Byrne and Dunnion, 2010; Chen et
al., 2010) exploited a QA system to fill slots by
constructing queries based on target entities and
slot types. However, their query templates con-
tain only a few additional query terms other than
the target entity name, which are mostly obtained
manually.
Most of QA systems use the question words
as-is or with expansion to form the retrieval sys-
tem query. Various query expansion approaches
have been used to tackle the passage-query mis-
match problem, including relevance feedback
(Derczynski et al., 2008), ontologies (Bhogal et
al., 2007), semantic lexica (Ofoghi et al., 2006),
etc. As a data-driven approach, relevance feed-
back is sensitive to the quality of first time re-
trieval. Our use of Freebase, a freely available
large semantic database, to provide distant su-
pervision requires neither labeled data nor costly
constructed knowledge models.
Some researchers (Grishman and Min, 2010;
Chrupala et al., 2010; Surdeanu et al., 2010) in-
tegrated IR and IE together. Surdeanu et al.
(2010) coupled the entity name with a handful of
hand-selected trigger words for each slot type as
queries to IR system in an effort to boost the
ranking of sentences likely to contain the rela-
tions of interest. Chrupala et al. (2010) proposed
one of the most customized passage retrieval
components for large-corpus IE. Besides the tar-
get name entity, they take into account the type
of expected named entity (such as ORGANIZA-
TION for the 'employee_of' relation) and expand
queries by predefined words that are predictive
for specific slot types (such as 'work' for the 'em-
ployee_of' relation). There are also relevant
works emerging from the IR community in the
Related entity Finding (REF) task in the TREC
Entity Track (Balog et al., 2010), which is to re-
turn a ranked list of related entities given an ex-
pected type of entity and a brief description (que-
ry) of the relation in free text. Fang et al. (2010)
ranked entities by their relevance to the query at
the document, passage and entity level, primarily
based on the similarity between terms.
In all this previous work, the limited number
of query terms has become the performance bot-
tleneck of the passage retrieval for large-corpus
information extraction.
Perhaps most similar to our distant supervision
keyword learning approach for passage retrieval
is the semi-automatic method of Nguyen et al.
(2007), who extract only several keywords for
each relation from Wikipedia and study only the
dependency subtrees that contain those key-
words. In contrast to their tf-idf model followed
by a manual selection step, our algorithm allows
us to fully automatically extract hundreds or
even thousands of keywords with a weight indi-
cating their relevance to each relation.
Mintz et al. (2009) proposed a distant supervi-
sion approach for relation extraction using a rich-
featured logistic regression model. Like us, they
used Freebase as a source of known relation in-
stances and Wikipedia as a text source to create
noisy training data and tested on the Wikipedia
data. Our approach differs from theirs in several
ways. First, our main concern is the speed re-
quired for large-corpus IE and reducing the
amount of text to process by passage retrieval,
while they use deep NLP features such as parsing
and process the whole corpus. Second, we assure
the quality of output by using a supervised in-
formation extraction system trained on golden
data, while their performance is constrained by
noisy training data. Third, we evaluate on a cor-
pus that consists of news and web data, while
they test on Wikipedia data that is from the same
source as the training data. We prove that our
method is adaptive to new domains because it is
based on lexical statistics and thus tolerant to
noise in the training data.
3 Freebase and Wikipedia
Freebase1 is a freely available online database of
structured knowledge. It collects information
about approximately 20 million entities (such as
1 http://www.freebase.com
1047
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime




