Joint adversarial training of speech recognition and synthesis models for many-to-one voice conversion using phonetic posteriorgrams

0Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

This paper presents a method for many-to-one voice conversion using phonetic posteriorgrams (PPGs) based on an adversarial training of deep neural networks (DNNs). A conventional method for many-to-one VC can learn a mapping function from input acoustic features to target acoustic features through separately trained DNN-based speech recognition and synthesis models. However, 1) the differences among speakers observed in PPGs and 2) an over-smoothing effect of generated acoustic features degrade the converted speech quality. Our method performs a domain-adversarial training of the recognition model for reducing the PPG differences. In addition, it incorporates a generative adversarial network into the training of the synthesis model for alleviating the over-smoothing effect. Unlike the conventional method, ours jointly trains the recognition and synthesis models so that they are optimized for many-to-one VC. Experimental evaluation demonstrates that the proposed method significantly improves the converted speech quality compared with conventional VC methods.

Cite

CITATION STYLE

APA

Saito, Y., Akuzawa, K., & Tachibana, K. (2020). Joint adversarial training of speech recognition and synthesis models for many-to-one voice conversion using phonetic posteriorgrams. IEICE Transactions on Information and Systems, E103D(9), 1978–1987. https://doi.org/10.1587/transinf.2019EDP7297

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free