SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

12Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher from a pre-defined teacher ensemble, which consists of multiple teachers with multi-level capacities, to transfer knowledge into student in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teachers. SKD has two advantages: 1) it can preserve the diversities of multi-level teachers via stochastically sampling single teacher in each iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teachers when large capacity gap exists between the teacher and the student. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT model by 40% while retaining 99.5% performances of language understanding and being 100% faster.

Cite

CITATION STYLE

APA

Ding, Z., Jiang, G., Zhang, S., Guo, L., & Lin, W. (2023). SKDBERT: Compressing BERT via Stochastic Knowledge Distillation. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023 (Vol. 37, pp. 7414–7422). AAAI Press. https://doi.org/10.1609/aaai.v37i6.25902

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free