Older legal texts are often scanned and digitized via Optical Character Recognition (OCR), which results in numerous errors. Although spelling and grammar checkers can correct much of the scanned text automatically, Named Entity Recognition (NER) is challenging, making correction of names difficult. To solve this, we developed an ensemble language model using a transformer neural network architecture combined with a finite state machine to extract names from English-language legal text. We use the US-based English language Harvard Caselaw Access Project for training and testing. Then, the extracted names are subjected to heuristic textual analysis to identify errors, make corrections, and quantify the extent of problems. With this system, we are able to extract most names, automatically correct numerous errors and identify potential mistakes that can later be reviewed for manual correction.
CITATION STYLE
Trias, F., Wang, H., Jaume, S., & Idreos, S. (2021). Named Entity Recognition in Historic Legal Text: A Transformer and State Machine Ensemble Method. In Natural Legal Language Processing, NLLP 2021 - Proceedings of the 2021 Workshop (pp. 172–179). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.nllp-1.18
Mendeley helps you to discover research relevant for your work.