Long-tail vocabulary dictionary extraction from the web

36Citations
Citations of this article
34Readers
Mendeley users who have this article in their library.
Get full text

Abstract

A dictionary - a set of instances belonging to the same conceptual class - is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the pagespecific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary. Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.

Cite

CITATION STYLE

APA

Chen, Z., Cafarella, M., & Jagadish, H. V. (2016). Long-tail vocabulary dictionary extraction from the web. In WSDM 2016 - Proceedings of the 9th ACM International Conference on Web Search and Data Mining (pp. 625–634). Association for Computing Machinery, Inc. https://doi.org/10.1145/2835776.2835778

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free