Audio Generation with Multiple Conditional Diffusion Model

22Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.

Abstract

Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation.

Cite

CITATION STYLE

APA

Guo, Z., Mao, J., Tao, R., Yan, L., Ouchi, K., Liu, H., & Wang, X. (2024). Audio Generation with Multiple Conditional Diffusion Model. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, pp. 18153–18161). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i16.29773

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free