Sign up & Download
Sign in

Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization

by Luís Marujo, Anatole Gershman, Jaime Carbonell, Robert Frederking, João P Neto
LREC 2012 ()

Abstract

Fast and effective automated indexing is critical for search and personalized services. Key phrases that consist of one or more words and represent the main concepts of the document are often used for the purpose of indexing. In this paper, we investigate the use of additional semantic features and pre-processing steps to improve automatic key phrase extraction. These features include the use of signal words and freebase categories. Some of these features lead to significant improvements in the accuracy of the results. We also experimented with 2 forms of document pre-processing that we call light filtering and co-reference normalization. Light filtering removes sentences from the document, which are judged peripheral to its main content. Co-reference normalization unifies several written forms of the same named entity into a unique form. We also needed a Gold Standard a set of labeled documents for training and evaluation. While the subjective nature of key phrase selection precludes a true Gold Standard, we used Amazons Mechanical Turk service to obtain a useful approximation. Our data indicates that the biggest improvements in performance were due to shallow semantic features, news categories, and rhetorical signals (nDCG 78.47% vs. 68.93%). The inclusion of deeper semantic features such as Freebase sub-categories was not beneficial by itself, but in combination with pre-processing, did cause slight improvements in the nDCG scores.

Cite this document (BETA)

Available from www.lrec-conf.org
Page 1
hidden

Supervised Topical Key Phrase Ext...

Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization Luís Marujo1,2, Anatole Gershman1, Jaime Carbonell1, Robert Frederking1, João P. Neto2 1 LTI/CMU, USA 2 INESC-IST, Portugal Luis.Marujo@inesc-id.pt, anatoleg@cs.cmu.edu, jgc@cs.cmu.edu, ref@cs.cmu.edu, Joao.Neto@inesc-id.pt Abstract Fast and effective automated indexing is critical for search and personalized services. Key phrases that consist of one or more words and represent the main concepts of the document are often used for the purpose of indexing. In this paper, we investigate the use of additional semantic features and pre-processing steps to improve automatic key phrase extraction. These features include the use of signal words and freebase categories. Some of these features lead to significant improvements in the accuracy of the results. We also experimented with 2 forms of document pre-processing that we call light filtering and co-reference normalization. Light filtering removes sentences from the document, which are judged peripheral to its main content. Co-reference normalization unifies several written forms of the same named entity into a unique form. We also needed a “Gold Standard” – a set of labeled documents for training and evaluation. While the subjective nature of key phrase selection precludes a true “Gold Standard”, we used Amazon’s Mechanical Turk service to obtain a useful approximation. Our data indicates that the biggest improvements in performance were due to shallow semantic features, news categories, and rhetorical signals (nDCG 78.47% vs. 68.93%). The inclusion of deeper semantic features such as Freebase sub-categories was not beneficial by itself, but in combination with pre-processing, did cause slight improvements in the nDCG scores. Keywords: Automatic Key Phrase Extraction, Semantic Features, Pre-Processing 1. Introduction In the last decade, the news consumption paradigm shifted from the traditional physical newspapers to personalized online news aggregation systems, such as News360, Google News, and Yahoo! News. These systems collect large amounts of news from various sources and provide an aggregate view of news on their websites and mobile applications. Fast and effective automated indexing is a critical problem for such services. Key phrases that consist of one or more words and represent the main concepts of the document are often used for the purpose of indexing. The precision and F1 measure of current state of the art automatic key-phrase extraction systems (AKE) is in the 30-50% range (Marujo et al., 2011 Medelyan et al., 2011 Witten et al., 1999). This makes improvements in AKE an urgent problem. In this work, we followed a fairly traditional approach of training a classifier to select an ordered list of the most likely candidates for key phrases in a given document. The main novelty of the paper is the use of additional semantic features and pre-processing steps. We tested several features, which to the best of our knowledge, have not been used for this purpose. These features include the use of signal words, freebase categories, etc. Some of these features lead to significant improvements in the accuracy of the results. We also experimented with 2 forms of document pre-processing that we call light filtering and co-reference normalization. Light filtering removes sentences from the document, which are judged peripheral to its main content. Co-reference normalization unifies several written forms of the same named entity into a unique form. In our experiments, both light filtering and co-reference normalization lead to small but noticeable improvements in the resulting accuracy of key phrase extraction. We also needed a set of “Gold Standard” (GS) labeled documents for training and evaluation. We used Amazon’s Mechanical Turk1 (Mturk) service to obtain these. In this paper, we report our experiments with crowdsourcing for key phrase extraction and the results of our experiments with 2 new pre-processing steps and new features. This paper is organized as follows: Section 2 presents the pre-processing steps the description of the new features explored is presented in Section 3 the creation of a GS dataset using crowd-sourcing is described in Section 4 Section 5 details how the experiments were performed and their results, and Section 6 contains conclusions and suggestions for future work. 2. Pre-Processing Light Filtering: our previous experiment with Portuguese-language broadcast news indicated that the elimination of about 10% of low-relevance sentences from the body of a news transcript results in a 2% improvement in AKE precision and recall. We hypothesized that similar improvements may be achieved in English-language news articles. We call this process light filtering. It is based on assigning a measure of relevance to each sentence of the article using centrality-as-relevance methods (Ribeiro et al. 2011). Centrality-as-relevance calculates pair-wise distances between sentences and finds a centroid for the article. The K sentences closest to the centroid are called the support set (SS). The distance between a sentence and the support set is used as a measure of this sentence 1 https://www.mturk.com/ 399
Page 2
hidden
relevance. Based on our previous experiments, we used 5 support sentences per document and removed the 10% of the most distant sentences from all documents using the Euclidean distance ( ࠵? and ࠵? are vectorial sentence representation and ࠵? designates the sentence length in words of the longest sentence): ࠵?!"#$%&!’( = ࠵? − ࠵? = | ! !!! ࠵?! − ࠵?!|! Co-reference Normalization: for stylistic reasons, journalists often use different forms of reference to the same named entities. For example, they might refer to Michael Jackson as Jackson or Michael. We hypothesized that normalizing such references would improve the AKE performance. We used ENCORE (Shah et al. 2011), a semi-supervised, ensemble co-reference resolution system to identify multiple forms of the same named entity and to normalize them into a single form (e.g., Michael Jackson). 3. Features Typically, classifier-based Automatic Key-phrase Extraction systems tools include such features as TF-IDF, (Salton et al. 1975): ࠵?࠵? − ࠵?࠵?࠵? ࠵?, ࠵? = ࠵?࠵? ࠵?, ࠵? × ࠵?࠵?࠵? ࠵?, ࠵? ࠵?࠵?࠵?(࠵?, ࠵?) = ࠵?࠵?࠵? |࠵?| 1 + | ࠵? ∈ ࠵?: ࠵? ∈ ࠵? | where, • tf(t,d) is the number of occurrences of term or phrase t in document d • |D| is the number of documents in the corpus • | ࠵? ∈ ࠵?: ࠵? ∈ ࠵? | is the number of documents containing term or phrase t Other features use position on the page (Witten et al., 1999), number of words in the phrase (Medelyan et al., 2011), part of speech tags (Marujo et al., 2011), etc. We decided to test two additional kinds of features: semantic and rhetorical. We used three levels of semantic features – shallow semantic features, top-categories and sub-categories. The shallow semantic features consist of five dimensions: 1. the number of characters in a phrase - empirically noun words that are long tend to be relevant, 2. the number of named entities - very often named entities are important key phrases typically this number is 0, 1, or 2, 3. the number of capital letters - the identification of acronyms is the main reason to include this feature, 4. the Part-of-Speech (POS) pattern of the phrase (e.g., noun, adj, noun, adj, adj, noun, etc.) – noun and noun phrases are the most common pattern observed in key phrases, verb and verb phrases are less frequent, and key phrases made of the remaining POS tags are rare we assign a distinct integer to each pattern, 5. the frequency of the phrase in the LDC HUB4 dataset 2 - to be precise we use the corresponding entry of 4-ngram model created using the dataset. The model was compressed using the Minimal Perfect Hash method (Guthrie et al., 2010) to reduce both memory consumption and access times to the model. We used smooth-nlp toolkit3 to compress the model. The top-categories we used are: Technology, Crime, Sports, Health, Art and Culture, Fashion, Science, Business, World Politics, and U.S. Politics. We also used 85 sub-categories taken from the Freebase domain names4. These included American Football, Baseball, Book, Exhibitions, Education Engineering, Music, etc. Both the top-categories and the sub-categories are used as binary features of a phrase. The top-category of each phrase is obtained from the document source category and the sub-categories are extracted by looking up the phrase in a Freebase dump. Authors of news articles use various rhetorical devices to direct the reader’s attention. The following eleven types of signals have been identified in the literature (Jarvelin et al., 2000): 1. Continuation - there are more ideas to come, e.g.: moreover, furthermore, in addition, another. 2. Change of direction – there is a change of topic, e.g.: in spite of, nevertheless, the opposite, on the contrary. 3. Sequence – there is an order in the presenting ideas, e.g.: in first place, next, into (far into the night). 4. Illustration – gives an example, e.g.: to illustrate, in the same way as, for instance, for example. 5. Emphasis – increases the relevance of an idea, these are the most important signals, e.g.: it all boils down to, the most substantial issue, should be noted, the crux of the matter, more than anything else. 6. Cause, Condition, or result – there is a conditional or modification coming to following idea, e.g.: if, because, resulting from. 7. Spatial signals – denote locations, e.g.: in front of, between, adjacent, west, east, north, south, beyond. 8. Comparison/contrast – comparison of 2 ideas, e.g.: analogous to, better, less than, less, like, either. 9. Conclusion – ending the introduction of the idea and may have special importance, e.g.: in summary, from this we see, last of all, hence, finally. 10. Fuzz – there is an idea that is not clear, e.g.: looks like, seems like, alleged, maybe, probably, sort of. 11. Non-word emphasis, e.g.: exclamation point (!),“quotation marks”. 2 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?cat alogId=LDC2000S88 3 http://tinyurl.com/MphfCompres 4 http://www.freebase.com/schema 400

Readership Statistics

4 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Ph.D. Student
 
25% Other Professional
 
25% Post Doc
by Country
 
25% United Kingdom
 
25% Ireland
 
25% Ukraine

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in