CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

7Citations
Citations of this article
81Readers
Mendeley users who have this article in their library.

Abstract

Existing approaches for audio-language task-specific prediction focus on building complicated late-fusion mechanisms. However, these models face challenges of overfitting with limited labels and poor generalization. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra- and inter- modalities connections between audio and language through two proxy tasks from a large number of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our CTAL model on multiple downstream audio- and-language tasks, we observe significant improvements on different tasks, including emotion classification, sentiment analysis, and speaker verification. Furthermore, we design a fusion mechanism in the fine-tuning phase, which allows CTAL to achieve better performance. Lastly, we conduct detailed ablation studies to demonstrate that both our novel cross-modality fusion component and audio-language pre-training methods contribute to the promising results. The code and pre-trained models are available at https://github.com/tal-ai/CTAL_EMNLP2021.

Cite

CITATION STYLE

APA

Li, H., Ding, W., Kang, Y., Liu, T., Wu, Z., & Liu, Z. (2021). CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations. In EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 3966–3977). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.emnlp-main.323

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free