Abstract
Existing approaches for audio-language task-specific prediction focus on building complicated late-fusion mechanisms. However, these models face challenges of overfitting with limited labels and poor generalization. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra- and inter- modalities connections between audio and language through two proxy tasks from a large number of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our CTAL model on multiple downstream audio- and-language tasks, we observe significant improvements on different tasks, including emotion classification, sentiment analysis, and speaker verification. Furthermore, we design a fusion mechanism in the fine-tuning phase, which allows CTAL to achieve better performance. Lastly, we conduct detailed ablation studies to demonstrate that both our novel cross-modality fusion component and audio-language pre-training methods contribute to the promising results. The code and pre-trained models are available at https://github.com/tal-ai/CTAL_EMNLP2021.
Cite
CITATION STYLE
Li, H., Ding, W., Kang, Y., Liu, T., Wu, Z., & Liu, Z. (2021). CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations. In EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 3966–3977). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.emnlp-main.323
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.