Abstract
Currently, Vision Transformer (ViT) and its variants have demonstrated promising performance on various computer vision tasks. Nevertheless, task-irrelevant information such as background nuisance and noise in patch tokens would damage the performance of ViT-based models. In this paper, we develop Sufficient Vision Transformer (Suf-ViT) as a new solution to address this issue. In our research, we propose the Sufficiency-Blocks (S-Blocks) to be applied across the depth of Suf-ViT to disentangle and discard task-irrelevant information accurately. Besides, to boost the training of Suf-ViT, we formulate a Sufficient-Reduction Loss (SRLoss) leveraging the concept of Mutual Information (MI) that enables Suf-ViT to extract more reliable sufficient representations by removing task-irrelevant information. Extensive experiments on benchmark datasets such as ImageNet, ImageNet-C, and CIFAR-10 indicate that our method can achieve state-of-the-art or competing performance over other baseline methods. Codes are available at: https://github.com/zhicheng2T0/Sufficient-Vision-Transformer.git
Author supplied keywords
Cite
CITATION STYLE
Cheng, Z., Su, X., Wang, X., You, S., & Xu, C. (2022). Sufficient Vision Transformer. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 190–200). Association for Computing Machinery. https://doi.org/10.1145/3534678.3539322
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.