Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attends to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.
CITATION STYLE
Liu, K., Wu, T., Liu, C., & Guo, G. (2022). Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention. In IJCAI International Joint Conference on Artificial Intelligence (pp. 1187–1193). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2022/166
Mendeley helps you to discover research relevant for your work.