The synchronization of read-aloud audio and text in language learning is a powerful reinforcement for learners at all levels. In order to provide this kind of synchronized media experience, audio must be aligned with the text so that the correct audio plays while the related text is being presented or highlighted. One solution for aligning text and audio in this way is a manual process using an audio editor, but this is time-consuming, expensive, and error-prone. A much faster and less expensive alternative is automatic alignment through the use of speech recognition. Since the text and the matching audio are known ahead of time, the speech recognizer can perform this task with a very low error rate. Further enhancing accuracy is the fact that read-aloud stories are typically recorded with careful speech at a lower word-per-minute rate than is typical of conversational speech. In Colibro Publishing’s approach, a Speech Recognition Grammar Specification grammar is generated from the text and provided to a speech recognizer, which then generates Extensible Multimodal Annotation output with the exact audio timestamps for the beginning and end points of each sentence. The alignment is then used in the interactive story production process so that the correct audio is played with highlighted text.
CITATION STYLE
Dahl, D. A., & Dooner, B. (2016). A case study of audio alignment for multimedia language learning: Applications of SRGS and EMMA in Colibro publishing. In Multimodal Interaction with W3C Standards: Toward Natural User Interfaces to Everything (pp. 311–321). Springer International Publishing. https://doi.org/10.1007/978-3-319-42816-1_14
Mendeley helps you to discover research relevant for your work.