The DARPA Spoken Language System (SLS) community has designed, implemented, and globally distributed significant speech corpora widely used for advancing speech recognition research. The Wall Street Journal (WSJ) CSR Corpus described here will provide DARPA its first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby supplying a means to integrate speech recognition and natural language processing in application domains with high potential practical value. This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms to be incorporated into the multi-faceted WSJ CSR Corpus. As of this writing, only the WSJ-pilot or Phase-one corpus (~80 hrs.) has been implemented.
CITATION STYLE
Paul, D. B., & Baker, J. M. (1992). The Design for the Wall Street Journal-based CSR Corpus. In 2nd International Conference on Spoken Language Processing, ICSLP 1992 (pp. 899–902). The International Society for Computers and Their Applications (ISCA). https://doi.org/10.3115/1075527.1075614
Mendeley helps you to discover research relevant for your work.