Selecting More Informative Training Sets with Fewer Observations

1Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

A standard text-as-data workflow in the social sciences involves identifying a set of documents to be labeled, selecting a random sample of them to label using research assistants, training a supervised learner to label the remaining documents, and validating that model's performance using standard accuracy metrics. The most resource-intensive component of this is the hand-labeling: carefully reading documents, training research assistants, and paying human coders to label documents in duplicate or more. We show that hand-coding an algorithmically selected rather than a simple-random sample can improve model performance above baseline by as much as 50%, or reduce hand-coding costs by up to two-thirds, in applications predicting (1) U.S. executive-order significance and (2) financial sentiment on social media. We accompany this manuscript with open-source software to implement these tools, which we hope can make supervised learning cheaper and more accessible to researchers.

Author supplied keywords

Cite

CITATION STYLE

APA

Kaufman, A. R. (2024). Selecting More Informative Training Sets with Fewer Observations. Political Analysis, 32(1), 133–139. https://doi.org/10.1017/pan.2023.19

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free