Sequence Models for Document Structure Identification in an Undeciphered Script

4Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This work describes the first thorough analysis of “header” signs in proto-Elamite, an undeciphered script from 3100-2900 BCE. Headers are a category of signs which have been provisionally identified through painstaking manual analysis of this script by domain experts. We use unsupervised neural and statistical sequence modeling techniques to provide new and independent evidence for the existence of headers, without supervision from domain experts. Having affirmed the existence of headers as a legitimate structural feature, we next arrive at a richer understanding of their possible meaning and purpose by (i) examining which features predict their presence; (ii) identifying correlations between these features and other document properties; and (iii) examining cases where these features predict the presence of a header in texts where domain experts do not expect one (or vice versa). We provide more concrete processes for labeling headers in this corpus and a clearer justification for existing intuitions about document structure in proto-Elamite.

Cite

CITATION STYLE

APA

Born, L., Monroe, M. W., Kelley, K., & Sarkar, A. (2022). Sequence Models for Document Structure Identification in an Undeciphered Script. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (pp. 9111–9121). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.emnlp-main.620

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free