Sequence Models for Document Structure Identification in an Undeciphered Script

Logan Born; M. Willis Monroe; Kathryn Kelley; Anoop Sarkar

Conference Proceedings

Sequence Models for Document Structure Identification in an Undeciphered Script

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (2022) 9111-9121

DOI: 10.18653/v1/2022.emnlp-main.620

4Citations

21Readers

Get full text

Abstract

This work describes the first thorough analysis of “header” signs in proto-Elamite, an undeciphered script from 3100-2900 BCE. Headers are a category of signs which have been provisionally identified through painstaking manual analysis of this script by domain experts. We use unsupervised neural and statistical sequence modeling techniques to provide new and independent evidence for the existence of headers, without supervision from domain experts. Having affirmed the existence of headers as a legitimate structural feature, we next arrive at a richer understanding of their possible meaning and purpose by (i) examining which features predict their presence; (ii) identifying correlations between these features and other document properties; and (iii) examining cases where these features predict the presence of a header in texts where domain experts do not expect one (or vice versa). We provide more concrete processes for labeling headers in this corpus and a clearer justification for existing intuitions about document structure in proto-Elamite.

Cite

CITATION STYLE

APA

Born, L., Monroe, M. W., Kelley, K., & Sarkar, A. (2022). Sequence Models for Document Structure Identification in an Undeciphered Script. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (pp. 9111–9121). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.emnlp-main.620

Sequence Models for Document Structure Identification in an Undeciphered Script

Abstract

Cite

Register to see more suggestions