GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model

2Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.

Abstract

Currently, the reduction in the parameter scale of large-scale pre-trained language models (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memorylimited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.

Cite

CITATION STYLE

APA

Tan, S., Tam, W. L., Wang, Y., Gong, W., Zhao, S., Zhang, P., & Tang, J. (2023). GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 5, pp. 134–148). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-industry.15

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free