A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game

33Citations
Citations of this article
29Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We describe a novel approach to collecting orthographically transcribed continuous speech data through the use of an online educational game called Voice Scatter, in which players study flashcards by using speech to match terms with their definitions. We analyze a corpus of 30,938 utterances, totaling 27.63 hours of speech, collected during the first 22 days that Voice Scatter was publicly available. Though each individual game covers only a small vocabulary, in aggregate speech recognition hypotheses in the corpus contain 21,758 distinct words. We show that Amazon Mechanical Turk can be used to orthographically transcribe utterances in the corpus quickly and cheaply, with near-expert accuracy. Moreover, we present a filtering technique that automatically identifies a sub-corpus of 39% of the data for which recognition hypotheses can be considered human-quality transcripts. We demonstrate the usefulness of such self-transcribed data for acoustic model adaptation.

Cite

CITATION STYLE

APA

Gruenstein, A., McGraw, I., & Sutherland, A. (2009). A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game. In Speech and Language Technology in Education, SLaTE 2009 (pp. 109–112). The International Society for Computers and Their Applications (ISCA). https://doi.org/10.21437/slate.2009-24

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free