WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Xinhao Mei; Chutong Meng; Haohe Liu; Qiuqiang Kong; Tom Ko; Chengqi Zhao; Mark D. Plumbley; Yuexian Zou; Wenwu Wang

Journal Article

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

IEEE/ACM Transactions on Audio Speech and Language Processing (2024) 32 3339-3354

DOI: 10.1109/TASLP.2024.3419446

107Citations

81Readers

Get full text

Abstract

—The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years, yet the limited size of existing audio-language datasets poses challenges for researchers due to the costly and time-consuming collection process. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400 k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing large language models (LLMs) to enhance academic research.

Author supplied keywords

Cite

CITATION STYLE

APA

Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., … Wang, W. (2024). WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research. IEEE/ACM Transactions on Audio Speech and Language Processing, 32, 3339–3354. https://doi.org/10.1109/TASLP.2024.3419446

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Abstract

Author supplied keywords

Cite

Register to see more suggestions