Feature Extraction for Large-Scale Text Collections

1Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Feature engineering is a fundamental but poorly documented component in Learning-to-Rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems, and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open-source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the extracted features. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems. In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems. To demonstrate the software's utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.

Cite

CITATION STYLE

APA

Gallagher, L., Mallia, A., Culpepper, J. S., Suel, T., & Cambazoglu, B. B. (2020). Feature Extraction for Large-Scale Text Collections. In International Conference on Information and Knowledge Management, Proceedings (pp. 3015–3022). Association for Computing Machinery. https://doi.org/10.1145/3340531.3412773

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free