Abstract
In this paper, we propose a new knowledge distillation approach called adaptive contrastive knowledge distillation (ACKD) for BERT compression. Different from existing knowledge distillation methods for BERT that implicitly learn discriminative student features by mimicking the teacher features, we first introduce a novel contrastive distillation loss (CDL) based on hidden state features in BERT as the explicit supervision to learn discriminative student features. We further observe sentences with similar features may have completely different meanings, which makes them hard to distinguish. Existing methods do not pay sufficient attention to these hard samples with less discriminative features. Therefore, we propose a new strategy called sample adaptive reweighting (SAR) to adaptively pay more attention to these hard samples and strengthen their discrimination abilities. We incorporate our SAR strategy into our CDL and form the adaptive contrastive distillation loss, based on which we construct our ACKD framework. Comprehensive experiments on multiple natural language processing tasks demonstrate the effectiveness of our ACKD framework.
Cite
CITATION STYLE
Guo, J., Liu, J., Wang, Z., Ma, Y., Gong, R., Xu, K., & Liu, X. (2023). Adaptive Contrastive Knowledge Distillation for BERT Compression. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 8941–8953). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.569
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.