Predicting F0 and voicing from NAM-captured whispered speech

6Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The NAM-to-speech conversion proposed by Toda and colleagues which converts Non-Audible Murmur (NAM) to audible speech by statistical mapping trained using aligned corpora is a very promising technique, but its performance is still insufficient, mainly due to the difficulty in estimating F0 of the transformed voice from unvoiced speech. In this paper, we propose a method to improve F0 estimation and voicing decision in a NAM-to-speech conversion system based on Gaussian Mixture Models (GMM) applied to whispered speech. Instead of combining voicing decision and F0 estimation in a single GMM, a simple feed-forward neural network is used to detect voiced segments in the whisper while a GMM estimates a continuous melodic contour based on training voiced segments. The error rate for the voiced/unvoiced decision of the network is 6.8% compared to 9.2% with the original system. Our proposal benefits also to F0 estimation error.

Cite

CITATION STYLE

APA

Tran, V. A., Bailly, G., Loevenbruck, H., & Toda, T. (2008). Predicting F0 and voicing from NAM-captured whispered speech. In Proceedings of the 4th International Conference on Speech Prosody, SP 2008 (pp. 107–110). International Speech Communications Association. https://doi.org/10.21437/speechprosody.2008-25

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free