Adaptive Contrastive Knowledge Distillation for BERT Compression

17Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we propose a new knowledge distillation approach called adaptive contrastive knowledge distillation (ACKD) for BERT compression. Different from existing knowledge distillation methods for BERT that implicitly learn discriminative student features by mimicking the teacher features, we first introduce a novel contrastive distillation loss (CDL) based on hidden state features in BERT as the explicit supervision to learn discriminative student features. We further observe sentences with similar features may have completely different meanings, which makes them hard to distinguish. Existing methods do not pay sufficient attention to these hard samples with less discriminative features. Therefore, we propose a new strategy called sample adaptive reweighting (SAR) to adaptively pay more attention to these hard samples and strengthen their discrimination abilities. We incorporate our SAR strategy into our CDL and form the adaptive contrastive distillation loss, based on which we construct our ACKD framework. Comprehensive experiments on multiple natural language processing tasks demonstrate the effectiveness of our ACKD framework.

Cite

CITATION STYLE

APA

Guo, J., Liu, J., Wang, Z., Ma, Y., Gong, R., Xu, K., & Liu, X. (2023). Adaptive Contrastive Knowledge Distillation for BERT Compression. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 8941–8953). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.569

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free