HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

Sunyam Bagga; Andrew Piper

Journal ArticleOPEN ACCESS

HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

Journal of Open Humanities Data (2022) 8

DOI: 10.5334/johd.71

4Citations

9Readers

Abstract

We present a new dataset built on prior work consisting of 1,671,370 randomly sampled pages of English-language prose roughly divided between modes of fictional and non-fictional writing and published between the years 1800 and 2000. In addition to focusing on the “page’’ as the basic bibliographic unit, our work employs a single predictive model for the historical period under consideration in contrast to prior work. Besides publication metadata, we also provide an enriched feature set of 107 features including part-of-speech tags, sentiment scores, word supersenses and more. Our data is designed to give researchers in the digital humanities large yet portable random samples of historical writing across two foundational modes of English prose writing. We present initial insights into transformations of linguistic patterns across this historical period using our enriched features as possible pointers to future work. The data can be accessed at https://doi.org/10.7910/DVN/HAKKUA.

Author supplied keywords

Cite

CITATION STYLE

APA

Bagga, S., & Piper, A. (2022). HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust. Journal of Open Humanities Data, 8. https://doi.org/10.5334/johd.71

HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

Abstract

Author supplied keywords

Cite

Register to see more suggestions