Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

21Citations
Citations of this article
37Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We present our submission1 to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.

Cite

CITATION STYLE

APA

Timiryasov, I., & Tastet, J. L. (2023). Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. In CoNLL 2023 - BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Proceedings (pp. 279–289). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.conll-babylm.24

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free