Abstract
A dictionary - a set of instances belonging to the same conceptual class - is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the pagespecific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary. Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.
Author supplied keywords
Cite
CITATION STYLE
Chen, Z., Cafarella, M., & Jagadish, H. V. (2016). Long-tail vocabulary dictionary extraction from the web. In WSDM 2016 - Proceedings of the 9th ACM International Conference on Web Search and Data Mining (pp. 625–634). Association for Computing Machinery, Inc. https://doi.org/10.1145/2835776.2835778
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.