Abstract
Recent years have witnessed a dramatic growing trend of Virtual YouTubers (VTubers) as a new business on social media, such as YouTube, Twitch, and TikTok. However, a significant challenge arises when VTuber voice actors face health issues or retire, jeopardizing the continuity of their avatar's recognizable voices. A potential solution reminiscent of Conan's Bow Tie voice changer in the popular animation Case Closed (i.e., Detective Conan) has inspired our work. To make this a reality, we introduce VTuberBowTie, a user-friendly streaming voice conversion system for real-time VTuber livestreaming. We propose an innovative streaming voice conversion approach that tackles the challenges of limited context modeling and bidirectional context dependence inherent to conventional real-time voice conversion. Rather than individually processing the voice stream in data chunks, our approach adopts a fully sequential structure that leverages contextual information preceding the input chunk, thereby expanding the perceptual range and enabling seamless concatenation. Moreover, we developed a ready-to-use interaction interface for VTuberBowTie and deployed it on various computing platforms. The experimental results show that VTuberBowTie can achieve high-quality voice conversion in a streaming manner with a latency of 179.1ms on CPU and 70.8ms on GPU while providing users a friendly interactive experience.
Author supplied keywords
Cite
CITATION STYLE
Chen, Q., Gu, Z., Lu, L., Xu, X., Ba, Z., Lin, F., … Ren, K. (2024). Conan’s Bow Tie: A Streaming Voice Conversion for Real-Time VTuber Livestreaming. In ACM International Conference Proceeding Series (pp. 35–50). Association for Computing Machinery. https://doi.org/10.1145/3640543.3645146
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.