Image-text matching is a fundamental research topic bridging vision and language. Recent works use hard negative mining to capture the multiple correspondences between visual and textual domains. Unfortunately, the truly informative negative samples are quite sparse in the training data, which are hard to obtain only in a randomly sampled mini-batch. Motivated by causal inference, we aim to overcome this shortcoming by carefully analyzing the analogy between hard negative mining and causal effects optimizing. Further, we propose Counterfactual Matching (CFM) framework for more effective image-text correspondence mining. CFM contains three major components, \ie, Gradient-Guided Feature Selection for automatic casual factor identification, Self-Exploration for causal factor completeness, and Self-Adjustment for counterfactual sample synthesis. Compared with traditional hard negative mining, our method largely alleviates the over-fitting phenomenon and effectively captures the fine-grained correlations between image and text modality. We evaluate our CFM in combination with three state-of-the-art image-text matching architectures. Quantitative and qualitative experiments conducted on two publicly available datasets demonstrate its strong generality and effectiveness. Code is available at: https://github.com/weihao20/cfm.
CITATION STYLE
Wei, H., Wang, S., Han, X., Xue, Z., Ma, B., Wei, X., & Wei, X. (2022). Synthesizing Counterfactual Samples for Effective Image-Text Matching. In MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia (pp. 4355–4364). Association for Computing Machinery, Inc. https://doi.org/10.1145/3503161.3547814
Mendeley helps you to discover research relevant for your work.