Text-to-speech synthesis

Yoshinori Shiga; Jinfu Ni; Kentaro Tachibana; Takuma Okamoto

Book Chapter

Text-to-speech synthesis

Springer, (2020), 39-52

DOI: 10.1007/978-981-15-0595-9_3

13Citations

25Readers

Get full text

Abstract

The recent progress of text-to-speech synthesis (TTS) technology has allowed computers to read any written text aloud with voice that is artificial but almost indistinguishable from real human speech. Such improvement in the quality of synthetic speech has expanded the application of the TTS technology. This chapter will explain the mechanism of a state-of-the-art TTS system after a brief introduction to some conventional speech synthesis methods with their advantages and weaknesses. The TTS system consists of two main components: text analysis and speech signal generation, both of which will be detailed in individual sections. The text analysis section will describe what kinds of linguistic features need to be extracted from text, and then present one of the latest studies at NICT from the forefront of TTS research. In this study, linguistic features are automatically extracted from plain text by applying an advanced deep learning technique. The later sections will detail a state-of-the-art speech signal generation using deep neural networks, and then introduce a pioneering study that has lately been conducted at NICT, where leading-edge deep neural networks that directly generate speech waveforms are combined with subband decomposition signal processing to enable rapid generation of human-sounding high-quality speech.

Cite

CITATION STYLE

APA

Shiga, Y., Ni, J., Tachibana, K., & Okamoto, T. (2020). Text-to-speech synthesis. In SpringerBriefs in Computer Science (pp. 39–52). Springer. https://doi.org/10.1007/978-981-15-0595-9_3

Text-to-speech synthesis

Abstract

Cite

Register to see more suggestions