Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

3Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.

Abstract

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attends to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.

Cite

CITATION STYLE

APA

Liu, K., Wu, T., Liu, C., & Guo, G. (2022). Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention. In IJCAI International Joint Conference on Artificial Intelligence (pp. 1187–1193). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2022/166

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free