Selecting More Informative Training Sets with Fewer Observations

Aaron R. Kaufman

Journal ArticleOPEN ACCESS

Selecting More Informative Training Sets with Fewer Observations

Kaufman A

Political Analysis (2024) 32(1) 133-139

DOI: 10.1017/pan.2023.19

1Citations

10Readers

Abstract

A standard text-as-data workflow in the social sciences involves identifying a set of documents to be labeled, selecting a random sample of them to label using research assistants, training a supervised learner to label the remaining documents, and validating that model's performance using standard accuracy metrics. The most resource-intensive component of this is the hand-labeling: carefully reading documents, training research assistants, and paying human coders to label documents in duplicate or more. We show that hand-coding an algorithmically selected rather than a simple-random sample can improve model performance above baseline by as much as 50%, or reduce hand-coding costs by up to two-thirds, in applications predicting (1) U.S. executive-order significance and (2) financial sentiment on social media. We accompany this manuscript with open-source software to implement these tools, which we hope can make supervised learning cheaper and more accessible to researchers.

Author supplied keywords

Cite

CITATION STYLE

APA

Kaufman, A. R. (2024). Selecting More Informative Training Sets with Fewer Observations. Political Analysis, 32(1), 133–139. https://doi.org/10.1017/pan.2023.19

Selecting More Informative Training Sets with Fewer Observations

Abstract

Author supplied keywords

Cite

Register to see more suggestions