X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks

Zhaowei Cai; Gukyeong Kwon; Avinash Ravichandran; Erhan Bas; Zhuowen Tu; Rahul Bhotika; Stefano Soatto

Conference Proceedings

X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2022) 13696 LNCS 290-308

DOI: 10.1007/978-3-031-20059-5_17

6Citations

48Readers

Get full text

Abstract

In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are aligned using an efficient dot-product operation. The whole network is trained end-to-end, such that the detector is optimized for the vision-language tasks instead of an off-the-shelf component. To overcome the limited size of paired object-language annotations, we leverage other weak types of supervision to expand the knowledge coverage. This simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at ∼ 20 frames per second without using any LVIS annotation during training. The code is available at https://github.com/amazon-research/cross-modal-detr.

Author supplied keywords

Cite

CITATION STYLE

APA

Cai, Z., Kwon, G., Ravichandran, A., Bas, E., Tu, Z., Bhotika, R., & Soatto, S. (2022). X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13696 LNCS, pp. 290–308). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-20059-5_17

X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks

Abstract

Author supplied keywords

Cite

Register to see more suggestions