The Denglisch Corpus of German-English Code-Switching

5Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.

Abstract

When multilingual speakers involve in a conversation they inevitably introduce code-switching (CS), i.e., mixing of more than one language between and within utterances. CS is still an understudied phenomenon, especially in the written medium, and relatively few computational resources for studying it are available. We describe a corpus of German-English codeswitching in social media interactions. We focus on some challenges in annotating CS, especially due to words whose language ID cannot be easily determined. We introduce a novel schema for such word-level annotation, with which we manually annotated a subset of the corpus. We then trained classifiers to predict and identify switches, and applied them to the remainder of the corpus. Thereby, we created a large-scale corpus of German-English mixed utterances with precise indications of CS points.

Cite

CITATION STYLE

APA

Osmelak, D., & Wintner, S. (2023). The Denglisch Corpus of German-English Code-Switching. In SIGTYP 2023 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop (pp. 42–51). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.sigtyp-1.5

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free