GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning

17Citations
Citations of this article
58Readers
Mendeley users who have this article in their library.

Abstract

In this work, we propose a novel framework, Gradient Aligned Mutual Learning BERT (GAML-BERT), for improving the early exiting of BERT. GAML-BERT's contributions are two-fold. We conduct a set of pilot experiments, which shows that mutual knowledge distillation between a shallow exit and a deep exit leads to better performances for both. From this observation, we use mutual learning to improve BERT's early exiting performances, that is, we ask each exit of a multi-exit BERT to distill knowledge from each other. Second, we propose GA, a novel training method that aligns the gradients from knowledge distillation to cross-entropy losses. Extensive experiments are conducted on the GLUE benchmark, which shows that our GAML-BERT can significantly outperform the state-of-the-art (SOTA) BERT early exiting methods.

Cite

CITATION STYLE

APA

Zhu, W., Wang, X., Ni, Y., Xie, G., Guo, Z., & Wu, X. (2021). GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning. In EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 3033–3044). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.emnlp-main.242

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free