The language models nowadays are in the center of natural language processing progress. These models are mostly of significant size. There are successful attempts to reduce them, but at least some of these attempts rely on randomness. We propose a novel distillation procedure leveraging on multiple teachers usage which alleviates random seed dependency and makes the models more robust. We show that this procedure applied to TinyBERT and DistilBERT models improves their worst case results up to 2% while keeping almost the same best-case ones. The latter fact keeps true with a constraint on computational time, which is important to lessen the carbon footprint. In addition, we present the results of an application of the proposed procedure to a computer vision model ResNet, which shows that the statement keeps true in this totally different domain.
CITATION STYLE
Ilichev, A., Sorokin, N., Malykh, V., & Piontkovskaya, I. (2021). Multiple Teacher Distillation for Robust and Greener Models. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 601–610). Incoma Ltd. https://doi.org/10.26615/978-954-452-072-4_068
Mendeley helps you to discover research relevant for your work.