A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game

Alexander Gruenstein; Ian McGraw; Andrew Sutherland

Conference Proceedings

A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game

Speech and Language Technology in Education, SLaTE 2009 (2009) 109-112

DOI: 10.21437/slate.2009-24

33Citations

29Readers

Get full text

Abstract

We describe a novel approach to collecting orthographically transcribed continuous speech data through the use of an online educational game called Voice Scatter, in which players study flashcards by using speech to match terms with their definitions. We analyze a corpus of 30,938 utterances, totaling 27.63 hours of speech, collected during the first 22 days that Voice Scatter was publicly available. Though each individual game covers only a small vocabulary, in aggregate speech recognition hypotheses in the corpus contain 21,758 distinct words. We show that Amazon Mechanical Turk can be used to orthographically transcribe utterances in the corpus quickly and cheaply, with near-expert accuracy. Moreover, we present a filtering technique that automatically identifies a sub-corpus of 39% of the data for which recognition hypotheses can be considered human-quality transcripts. We demonstrate the usefulness of such self-transcribed data for acoustic model adaptation.

Cite

CITATION STYLE

APA

Gruenstein, A., McGraw, I., & Sutherland, A. (2009). A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game. In Speech and Language Technology in Education, SLaTE 2009 (pp. 109–112). The International Society for Computers and Their Applications (ISCA). https://doi.org/10.21437/slate.2009-24

A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game

Abstract

Cite

Register to see more suggestions