TAVT:Towards Transferable Audio-Visual Text Generation

5Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

Audio-visual text generation aims to understand multi-modality contents and translate them into texts. Although various transfer learning techniques of text generation have been proposed, they focused on uni-modal analysis (e.g., text-to-text, visual-to-text) and lack consideration of multi-modal content and cross-modal relation. Motivated by the fact that humans can recognize the timbre of the same low-level concepts (e.g., footstep, rainfall, and laughing), even in different visual conditions, we aim to mitigate the domain discrepancies by audiovisual correlation. In this paper, we propose a novel Transferable Audio-Visual Text Generation framework, named TAVT, which consists of two key components: Audio-Visual Meta-Mapper (AVMM) and Dual Counterfactual Contrastive Learning (DCCL). (1) AVMM first introduces a universal auditory semantic space and drifts the domain-invariant low-level concepts into visual prefixes. Then the reconstruct-based learning encourages the AVMM to learn “which pixels belong to the same sound” and achieve audio-enhanced visual prefix. The well-trained AVMM can be further applied to uni-modal setting. (2) Furthermore, DCCL leverages the destructive counterfactual transformations to provide cross-modal constraints for AVMM from the perspective of feature distribution and text generation. (3) The experimental results show that TAVT outperforms the state-of-the-art methods across multiple domains (cross-datasets, cross-categories) and various modal settings (uni-modal, multi-modal).

Cite

CITATION STYLE

APA

Lin, W., Jin, T., Wang, Y., Pan, W., Li, L., Cheng, X., & Zhao, Z. (2023). TAVT:Towards Transferable Audio-Visual Text Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 14983–14999). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.836

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free