Abstract
Crowdsourcing is essential for collecting information about real-world entities. Existing crowdsourced data extraction solutions use fixed, non-adaptive querying strategies that repeatedly ask workers to provide entities from a fixed domain until a desired level of coverage is reached. Unfortunately, such solutions are highly impractical as they yield many duplicate extractions. We design an adaptive querying framework, CRUX, that maximizes the number of extracted entities for a given budget. We show that the problem of budgeted crowdsourced entity extraction is NP-Hard. We leverage two insights to focus our extraction efforts: exploiting the structure of the domain of interest, and using exclude lists to limit repeated extractions. We develop new statistical tools to reason about the number of new distinct extracted entities of additional queries under the presence of little information, and embed them within adaptive algorithms that maximize the distinct extracted entities under budget constraints. We evaluate our techniques on synthetic and real-world datasets, demonstrating an improvement of up to 300% over competing approaches for the same budget.
Author supplied keywords
Cite
CITATION STYLE
Rekatsinas, T., Deshpande, A., & Parameswaran, A. (2019). CRUX: Adaptive querying for efficient crowdsourced data extraction. In International Conference on Information and Knowledge Management, Proceedings (pp. 841–850). Association for Computing Machinery. https://doi.org/10.1145/3357384.3357976
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.