Automatic Speech Recognition (ASR) models can aid field linguists by facilitating the creation of text corpora from oral material. Training ASR systems for low-resource languages can be a challenging task not only due to lack of resources but also due to the work required for the preparation of a training dataset. We present a pipeline for data processing and ASR model training for low-resourced languages, based on the language family. As a case study, we collected recordings of Pomak, an endangered South East Slavic language variety spoken in Greece. Using the proposed pipeline, we trained the first Pomak ASR model.
CITATION STYLE
Tsoukala, C., Kritsis, K., Douros, I., Katsamanis, A., Kokkas, N., Arampatzakis, V., … Pavlidis, G. (2023). ASR pipeline for low-resourced languages: A case study on Pomak. In FieldMatters 2023 - 2nd Workshop on NLP Applications to Field Linguistics, Proceedings (pp. 30–39). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.fieldmatters-1.5
Mendeley helps you to discover research relevant for your work.