A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

3Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Distillation from Weak Teacher (DWT) is a method of transferring knowledge from a smaller, weaker teacher model to a larger student model to improve its performance. Previous studies have shown that DWT can be effective in the vision domain and natural language processing (NLP) pre-training stage. Specifically, DWT shows promise in practical scenarios, such as enhancing new generation or larger models using pre-trained yet older or smaller models and lacking a resource budget. However, the optimal conditions for using DWT have yet to be fully investigated in NLP pre-training. Therefore, this study examines three key factors to optimize DWT, distinct from those used in the vision domain or traditional knowledge distillation. These factors are: (i) the impact of teacher model quality on DWT effectiveness, (ii) guidelines for adjusting the weighting value for DWT loss, and (iii) the impact of parameter remapping as a student model initialization technique for DWT.

Cite

CITATION STYLE

APA

Lee, H., Hou, R., Kim, J., Liang, D., Hwang, S. J., & Min, A. (2023). A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 11239–11246). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.714

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free