Nowadays multimedia contents including text, images, and videos have been produced and shared ubiquitously in our daily life, which has encouraged researchers to develop algorithms for multimedia search and analysis in various applications. The trend of web data becoming increasingly multimodal makes the task of multimodal classification ever more popular and pertinent. In this paper, we mainly focus on the scenario of videos for their intrinsic multimodal property, and resort to attention learning among different modalities for classification. Specifically, we formulate the multimodal attention learning as a sequential decision-making process, and propose an end-to-end, deep reinforcement learning based framework to determine the selection of modality at each time step for the final feature aggregation model. To train our policy networks, we design a supervised reward which considers the multi-label classification loss, and two unsupervised rewards which simultaneously consider inter-modality correlation for consistency and intra-modality reconstruction for representativeness. Extensive experiments have been conducted on two large-scale multimodal video datasets to evaluate the whole framework and several key components, including the parameters of policy network, the effects of different rewards, and the rationality of the learned visual-text attention. Promising results demonstrate that our approach outperforms other state-of-the-art methods of attention mechanism and multimodal fusion for video classification task.
CITATION STYLE
Liu, M., & Liu, Z. (2019). Deep reinforcement learning visual-text attention for multimodal video classification. In MULEA 2019 - 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications, co-located with MM 2019 (pp. 13–21). Association for Computing Machinery, Inc. https://doi.org/10.1145/3347450.3357654
Mendeley helps you to discover research relevant for your work.