Automatic construction of a large-scale speech recognition database using multi-genre broadcast data with inaccurate subtitle timestamps

Jeong Uk Bang; Mu Yeol Choi; Sang Hun Kim; Oh Wook Kwon

Journal ArticleOPEN ACCESS

Automatic construction of a large-scale speech recognition database using multi-genre broadcast data with inaccurate subtitle timestamps

IEICE Transactions on Information and Systems (2020) E103D(2) 406-415

DOI: 10.1587/transinf.2019EDP7234

14Citations

6Readers

Abstract

As deep learning-based speech recognition systems are spotlighted, the need for large-scale speech databases for acoustic model training is increasing. Broadcast data can be easily used for database construction, since it contains transcripts for the hearing impaired. However, the subtitle timestamps have not been used to extract speech data because they are often inaccurate due to the inherent characteristics of closed captioning. Thus, we propose to build a large-scale speech database from multi-genre broadcast data with inaccurate subtitle timestamps. The proposed method first extracts the most likely speech intervals by removing subtitle texts with low subtitle quality index, concatenating adjacent subtitle texts into a merged subtitle text, and adding a margin to the timestamp of the merged subtitle text. Next, a speech recognizer is used to extract a hypothesis text of a speech segment corresponding to the merged subtitle text, and then the hypothesis text obtained from the decoder is recursively aligned with the merged subtitle text. Finally, the speech database is constructed by selecting the sub-parts of the merged subtitle text that match the hypothesis text. Our method successfully refines a large amount of broadcast data with inaccurate subtitle timestamps, taking about half of the time compared with the previous methods. Consequently, our method is useful for broadcast data processing, where bulk speech data can be collected every hour.

Author supplied keywords

Cite

CITATION STYLE

APA

Bang, J. U., Choi, M. Y., Kim, S. H., & Kwon, O. W. (2020). Automatic construction of a large-scale speech recognition database using multi-genre broadcast data with inaccurate subtitle timestamps. IEICE Transactions on Information and Systems, E103D(2), 406–415. https://doi.org/10.1587/transinf.2019EDP7234

Automatic construction of a large-scale speech recognition database using multi-genre broadcast data with inaccurate subtitle timestamps

Abstract

Author supplied keywords

Cite

Register to see more suggestions