Cost-effective Distillation of Large Language Models

8Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.

Abstract

Knowledge distillation (KD) involves training a small “student” model to replicate the strong performance of a high-capacity “teacher” model, enabling efficient deployment in resource-constrained settings. Top-performing methods tend to be task- or architecture-specific and lack generalizability. Several existing approaches require pretraining of the teacher on task-specific datasets, which can be costly for large and unstable for small datasets. Here we propose an approach for improving KD through a novel distillation loss agnostic to the task and model architecture. We successfully apply our method to the distillation of the BERT-base and achieve highly competitive results from the distilled student across a range of GLUE tasks, especially for tasks with smaller datasets.

Cite

CITATION STYLE

APA

Dasgupta, S., Cohn, T., & Baldwin, T. (2023). Cost-effective Distillation of Large Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 7346–7354). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.463

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free