In this paper, we present benefits of using crowdsourcing to build speech and language resources for different annotation tasks for dialectal Arabic as an example of resource-poor languages. We show recommendations for job design and quality control that allow us to build high quality data for variety of tasks. Most of these recommendations are language-independent and can be applied to other languages as well. We summarize lessons learned from experiments in data acquisition tasks, such as image annotation (transcription of Arabic historical documents), machine translation (translation from English to Hindi), speech annotation (transcription of dialectal Arabic audio files), text annotation (conversion from dialectal Arabic to Modern Standard Arabic (MSA)), and text classification (annotation of offensive language on Arabic social media, and classification of questions on Arabic medical web forums).
CITATION STYLE
Mubarak, H. (2018). Crowdsourcing speech and language data for resource-poor languages. In Advances in Intelligent Systems and Computing (Vol. 639, pp. 440–447). Springer Verlag. https://doi.org/10.1007/978-3-319-64861-3_41
Mendeley helps you to discover research relevant for your work.