One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Chuhan Wu; Fangzhao Wu; Yongfeng Huang

Conference ProceedingsOPEN ACCESS

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021) 4408-4413

DOI: 10.18653/v1/2021.findings-acl.387

27Citations

92Readers

Abstract

Pre-trained language models (PLMs) achieve great success in NLP. However, their huge model sizes hinder their applications in many practical systems. Knowledge distillation is a popular technique to compress PLMs, which learns a small student model from a large teacher PLM. However, the knowledge learned from a single teacher may be limited and even biased, resulting in low-quality student model. In this paper, we propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression, which can train high-quality student model from multiple teacher PLMs. In MT-BERT we design a multi-teacher co-finetuning method to jointly finetune multiple teacher PLMs in downstream tasks with shared pooling and prediction layers to align their output space for better collaborative teaching. In addition, we propose a multi-teacher hidden loss and a multi-teacher distillation loss to transfer the useful knowledge in both hidden states and soft labels from multiple teacher PLMs to the student model. Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.

Cite

CITATION STYLE

APA

Wu, C., Wu, F., & Huang, Y. (2021). One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 4408–4413). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.387

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Abstract

Cite

Register to see more suggestions