Text-to-Audio Generation using Instruction Guided Latent Diffusion Model

8Citations
Citations of this article
54Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation-a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach (Tango) outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

Cite

CITATION STYLE

APA

Ghosal, D., Majumder, N., Mehrish, A., & Poria, S. (2023). Text-to-Audio Generation using Instruction Guided Latent Diffusion Model. In MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (pp. 3590–3598). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612348

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free