Abstract
We present our submission1 to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.
Cite
CITATION STYLE
Timiryasov, I., & Tastet, J. L. (2023). Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. In CoNLL 2023 - BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Proceedings (pp. 279–289). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.conll-babylm.24
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.