Abstract
Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. As annotations for class-level visual characteristics, attributes are widely used semantic information for zero-shot image classification. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the lack of fine-grained annotations, but also the issues of attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model’s capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model’s discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multitask learning policy for considering multi-model objectives. We find that DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark, and that its components are effective and its predictions are interpretable.
Cite
CITATION STYLE
Chen, Z., Huang, Y., Chen, J., Geng, Y., Zhang, W., Fang, Y., … Chen, H. (2023). DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023 (Vol. 37, pp. 405–413). AAAI Press. https://doi.org/10.1609/aaai.v37i1.25114
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.