Big Data versus the Crowd: Lookin...
Appears in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL), 2012. Big Data versus the Crowd: Looking for Relationships in All the Right Places Ce Zhang Feng Niu Christopher Re´ Jude Shavlik Department of Computer Sciences University of Wisconsin-Madison, USA {czhang,leonn,chrisre,shavlik}@cs.wisc.edu Abstract Classically, training relation extractors relies on high-quality, manually annotated training data, which can be expensive to obtain. To mitigate this cost, NLU researchers have con- sidered two newly available sources of less expensive (but potentially lower quality) la- beled data from distant supervision and crowd sourcing. There is, however, no study com- paring the relative impact of these two sources on the precision and recall of post-learning an- swers. To fill this gap, we empirically study how state-of-the-art techniques are affected by scaling these two sources. We use corpus sizes of up to 100 million documents and tens of thousands of crowd-source labeled examples. Our experiments show that increasing the cor- pus size for distant supervision has a statis- tically significant, positive impact on quality (F1 score). In contrast, human feedback has a positive and statistically significant, but lower, impact on precision and recall. 1 Introduction Relation extraction is the problem of populating a target relation (representing an entity-level relation- ship or attribute) with facts extracted from natural- language text. Sample relations include people’s ti- tles, birth places, and marriage relationships. Traditional relation-extraction systems rely on manual annotations or domain-specific rules pro- vided by experts, both of which are scarce re- sources that are not portable across domains. To remedy these problems, recent years have seen in- terest in the distant supervision approach for rela- tion extraction (Wu and Weld, 2007 Mintz et al., 2009). The input to distant supervision is a set of seed facts for the target relation together with an (unlabeled) text corpus, and the output is a set of (noisy) annotations that can be used by any ma- chine learning technique to train a statistical model for the target relation. For example, given the tar- get relation birthPlace(person, place) and a seed fact birthPlace(John, Springfield), the sentence “John and his wife were born in Springfield in 1946” (S1) would qualify as a positive training example. Distant supervision replaces the expensive pro- cess of manually acquiring annotations that is re- quired by direct supervision with resources that al- ready exist in many scenarios (seed facts and a text corpus). On the other hand, distantly labeled data may not be as accurate as manual annotations. For example, “John left Springfield when he was 16” (S2) would also be considered a positive ex- ample about place of birth by distant supervision as it contains both John and Springfield. The hy- pothesis is that the broad coverage and high redun- dancy in a large corpus would compensate for this noise. For example, with a large enough corpus, a distant supervision system may find that patterns in the sentence S1 strongly correlate with seed facts of birthPlace whereas patterns in S2 do not qualify as a strong indicator. Thus, intuitively the quality of distant supervision should improve as we use larger corpora. However, there has been no study on the impact of corpus size on distant supervision for re- lation extraction. Our goal is to fill this gap. Besides “big data,” another resource that may be valuable to distant supervision is crowdsourc-
ing. For example, one could employ crowd work- ers to provide feedback on whether distant super- vision examples are correct or not (Gormley et al., 2010). Intuitively the crowd workforce is a perfect fit for such tasks since many erroneous distant la- bels could be easily identified and corrected by hu- mans. For example, distant supervision may mistak- enly consider “Obama took a vacation in Hawaii” a positive example for birthPlace simply because a database says that Obama was born in Hawaii a crowd worker would correctly point out that this sentence is not actually indicative of this relation. It is unclear however which strategy one should use: scaling the text corpus or the amount of human feedback. Our primary contribution is to empirically assess how scaling these inputs to distant supervi- sion impacts its result quality. We study this ques- tion with input data sets that are orders of magnitude larger than those in prior work. While the largest corpus (Wikipedia and New York Times) employed by recent work on distant supervision (Mintz et al., 2009 Yao et al., 2010 Hoffmann et al., 2011) con- tain about 2M documents, we run experiments on a 100M-document (50X more) corpus drawn from ClueWeb.1 While prior work (Gormley et al., 2010) on crowdsourcing for distant supervision used thou- sands of human feedback units, we acquire tens of thousands of human-provided labels. Despite the large scale, we follow state-of-the-art distant super- vision approaches and use deep linguistic features, e.g., part-of-speech tags and dependency parsing.2 Our experiments shed insight on the following two questions: 1. How does increasing the corpus size impact the quality of distant supervision? 2. For a given corpus size, how does increasing the amount of human feedback impact the qual- ity of distant supervision? We found that increasing corpus size consistently and significantly improves recall and F1, despite re- ducing precision on small corpora in contrast, hu- man feedback has relatively small impact on preci- sion and recall. For example, on a TAC corpus with 1.8M documents, we found that increasing the cor- pus size ten-fold consistently results in statistically 1http://lemurproject.org/clueweb09.php/ 2We used 100K CPU hours to run such tools on ClueWeb. significant improvement in F1 on two standardized relation extraction metrics (t-test with p=0.05). On the other hand, increasing human feedback amount ten-fold results in statistically significant improve- ment on F1 only when the corpus contains at least 1M documents and the magnitude of such improve- ment was only one fifth compared to the impact of corpus-size increment. We find that the quality of distant supervision tends to be recall gated, that is, for any given rela- tion, distant supervision fails to find all possible lin- guistic signals that indicate a relation. By expanding the corpus one can expand the number of patterns that occur with a known set of entities. Thus, as a rule of thumb for developing distant supervision sys- tems, one should first attempt to expand the training corpus and then worry about precision of labels only after having obtained a broad-coverage corpus. Throughout this paper, it is important to under- stand the difference between mentions and entities. Entities are conceptual objects that exist in the world (e.g., Barack Obama), whereas authors use a variety of wordings to refer to (which we call “mention”) entities in text (Ji et al., 2010). 2 Related Work The idea of using entity-level structured data (e.g., facts in a database) to generate mention-level train- ing data (e.g., in English text) is a classic one: re- searchers have used variants of this idea to extract entities of a certain type from webpages (Hearst, 1992 Brin, 1999). More closely related to relation extraction is the work of Lin and Patel (2001) that uses dependency paths to find answers that express the same relation as in a question. Since Mintz et al. (2009) coined the name “dis- tant supervision,” there has been growing interest in this technique. For example, distant supervision has been used for the TAC-KBP slot-filling tasks (Sur- deanu et al., 2010) and other relation-extraction tasks (Hoffmann et al., 2010 Carlson et al., 2010 Nguyen and Moschitti, 2011a Nguyen and Mos- chitti, 2011b). In contrast, we study how increas- ing input size (and incorporating human feedback) improves the result quality of distant supervision. We focus on logistic regression, but it is interest- ing future work to study more sophisticated prob-