Currently, the reduction in the parameter scale of large-scale pre-trained language models (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memorylimited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.
CITATION STYLE
Tan, S., Tam, W. L., Wang, Y., Gong, W., Zhao, S., Zhang, P., & Tang, J. (2023). GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 5, pp. 134–148). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-industry.15
Mendeley helps you to discover research relevant for your work.