Multiple Teacher Distillation for Robust and Greener Models

2Citations
Citations of this article
35Readers
Mendeley users who have this article in their library.

Abstract

The language models nowadays are in the center of natural language processing progress. These models are mostly of significant size. There are successful attempts to reduce them, but at least some of these attempts rely on randomness. We propose a novel distillation procedure leveraging on multiple teachers usage which alleviates random seed dependency and makes the models more robust. We show that this procedure applied to TinyBERT and DistilBERT models improves their worst case results up to 2% while keeping almost the same best-case ones. The latter fact keeps true with a constraint on computational time, which is important to lessen the carbon footprint. In addition, we present the results of an application of the proposed procedure to a computer vision model ResNet, which shows that the statement keeps true in this totally different domain.

Cite

CITATION STYLE

APA

Ilichev, A., Sorokin, N., Malykh, V., & Piontkovskaya, I. (2021). Multiple Teacher Distillation for Robust and Greener Models. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 601–610). Incoma Ltd. https://doi.org/10.26615/978-954-452-072-4_068

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free