Long-tail vocabulary dictionary extraction from the web

Zhe Chen; Michael Cafarella; H. V. Jagadish

Conference Proceedings

Long-tail vocabulary dictionary extraction from the web

WSDM 2016 - Proceedings of the 9th ACM International Conference on Web Search and Data Mining (2016) 625-634

DOI: 10.1145/2835776.2835778

36Citations

34Readers

Get full text

Abstract

A dictionary - a set of instances belonging to the same conceptual class - is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the pagespecific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary. Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.

Author supplied keywords

Cite

CITATION STYLE

APA

Chen, Z., Cafarella, M., & Jagadish, H. V. (2016). Long-tail vocabulary dictionary extraction from the web. In WSDM 2016 - Proceedings of the 9th ACM International Conference on Web Search and Data Mining (pp. 625–634). Association for Computing Machinery, Inc. https://doi.org/10.1145/2835776.2835778

Long-tail vocabulary dictionary extraction from the web

Abstract

Author supplied keywords

Cite

Register to see more suggestions