Abstract
A standard text-as-data workflow in the social sciences involves identifying a set of documents to be labeled, selecting a random sample of them to label using research assistants, training a supervised learner to label the remaining documents, and validating that model's performance using standard accuracy metrics. The most resource-intensive component of this is the hand-labeling: carefully reading documents, training research assistants, and paying human coders to label documents in duplicate or more. We show that hand-coding an algorithmically selected rather than a simple-random sample can improve model performance above baseline by as much as 50%, or reduce hand-coding costs by up to two-thirds, in applications predicting (1) U.S. executive-order significance and (2) financial sentiment on social media. We accompany this manuscript with open-source software to implement these tools, which we hope can make supervised learning cheaper and more accessible to researchers.
Author supplied keywords
Cite
CITATION STYLE
Kaufman, A. R. (2024). Selecting More Informative Training Sets with Fewer Observations. Political Analysis, 32(1), 133–139. https://doi.org/10.1017/pan.2023.19
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.