In this paper, we present a segmentation system for German texts. We apply conditional random fields (CRF), a statistical sequential model, to a type of text used in private communication. We show that by segmenting individual punctuation, and by taking into account freestanding lines and that using unsupervised word representation (i. e., Brown clustering, Word2Vec and Fasttext) achieved a label accuracy of 96% in a corpus of postcards used in private communication.
CITATION STYLE
Sugisaki, K. (2018). Word and sentence segmentation in german: overcoming idiosyncrasies in the use of punctuation in private communication. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10713 LNAI, pp. 62–71). Springer Verlag. https://doi.org/10.1007/978-3-319-73706-5_6
Mendeley helps you to discover research relevant for your work.