This paper presents a novel intra-gender statistical singing voice conversion (SVC) technique with direct waveform modification based on the log-spectrum differential (DIFFSVC) that can convert the voice timbre of a source singer into that of a target singer without vocoder-based waveform generation of the converted singing voice. SVC makes it possible to convert the singing voice characteristics of an arbitrary source singer into those of an arbitrary target singer by converting some of its acoustic features, such as F0, aperiodicity, and spectral features based on a statistical conversion function. However, the sound quality of the converted singing voice is typically degraded compared with that of a natural singing voice, owing to various factors, such as analysis and modeling errors in the vocoding process and over-smoothing of the converted feature trajectory. To alleviate sound quality degradation, we propose a statistical conversion process that directly modifies the signal in the waveform domain by estimating the difference in the spectra of the source and target singers’ singing voices. Additionally, we propose the following several techniques for the DIFFSVC method: 1) derivation of a differential Gaussian mixture model (DIFFGMM) from a conventional Gaussian mixture model (GMM) and 2) a parameter generation algorithm considering the global variance (GV). The experimental results demonstrate that the proposed DIFFSVC methods enable significant improvements in the sound quality of the converted singing voice, while preserving the conversion accuracy of the singer's identity compared with conventional SVC.
Kobayashi, K., Toda, T., & Nakamura, S. (2018). Intra-gender statistical singing voice conversion with direct waveform modification using log-spectral differential. Speech Communication, 99, 211–220. https://doi.org/10.1016/j.specom.2018.03.011