Improving Closed and Open-Vocabulary Attribute Prediction Using Transformers

5Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We study recognizing attributes for objects in visual scenes. We consider attributes to be any phrases that describe an object’s physical and semantic properties, and its relationships with other objects. Existing work studies attribute prediction in a closed setting with a fixed set of attributes, and implements a model that uses limited context. We propose TAP, a new Transformer-based model that can utilize context and predict attributes for multiple objects in a scene in a single forward pass, and a training scheme that allows this model to learn attribute prediction from image-text datasets. Experiments on the large closed attribute benchmark VAW show that TAP outperforms the SOTA by 5.1% mAP. In addition, by utilizing pretrained text embeddings, we extend our model to OpenTAP which can recognize novel attributes not seen during training. In a large-scale setting, we further show that OpenTAP can predict a large number of seen and unseen attributes that outperforms large-scale vision-text model CLIP by a decisive margin. The project page is available at https://vkhoi.github.io/TAP.

Cite

CITATION STYLE

APA

Pham, K., Kafle, K., Lin, Z., Ding, Z., Cohen, S., Tran, Q., & Shrivastava, A. (2022). Improving Closed and Open-Vocabulary Attribute Prediction Using Transformers. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13685 LNCS, pp. 201–219). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-19806-9_12

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free