Voice conversion using speaker-dependent conditional restricted Boltzmann machine

14Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.

Abstract

This paper presents a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs) for each speaker to obtain high-order speaker-independent spaces where voice features are converted more easily than those in an original acoustic feature space. The CRBM is expected to automatically discover common features lurking in time-series data. When we train two CRBMs for a source and target speaker independently using only speaker-dependent training data, it can be considered that each CRBM tries to construct subspaces where there are fewer phonemes and relatively more speaker individuality than the original acoustic space because the training data include various phonemes while keeping the speaker individuality unchanged. Each obtained high-order feature is then concatenated using a neural network (NN) from the source to the target. The entire network (the two CRBMs and the NN) can be also fine-tuned as a recurrent neural network (RNN) using the acoustic parallel data since both the CRBMs and the concatenating NN have network-based representation with time dependencies. Through voice-conversion experiments, we confirmed the high performance of our method especially in terms of objective evaluation, comparing it with conventional GMM, NN, RNN, and our previous work, speaker-dependent DBN approaches.

Cite

CITATION STYLE

APA

Nakashika, T., Takiguchi, T., & Ariki, Y. (2015). Voice conversion using speaker-dependent conditional restricted Boltzmann machine. Eurasip Journal on Audio, Speech, and Music Processing, 2015(1). https://doi.org/10.1186/s13636-014-0044-3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free