LMentry: A Language Model Benchmark of Elementary Language Tasks

2Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.

Abstract

As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well. We present LMentry, a benchmark that avoids this “arms race” by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer. LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of large language models. Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for large language models, including OpenAI's latest 175B-parameter instruction-tuned model, TextDavinci002. LMentry complements contemporary evaluation approaches of large language models, providing a quick, automatic, and easy-to-run “unit test”, without resorting to large benchmark suites of complex tasks.

Cite

CITATION STYLE

APA

Efrat, A., Honovich, O., & Levy, O. (2023). LMentry: A Language Model Benchmark of Elementary Language Tasks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 10476–10501). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.666

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free