How do we evaluate LLMs and determine the aspects and limits of their intelligent behaviour? When exposed to visual tests of analytic intelligence, human problem-solvers identify rules applied to relevant objects and attributes. Based on the induced rules, they can generalise and are able to provide a solution to the test. An analogous language task has recently been proposed (called BLM) for LLM. In this paper, we use this task to investigate what linguistic reasoning LLM develop, by asking them to solve some simple variants of the BLM task. We find that current state-of-the-art generative models can handle the task: they easily understand the instructions and can provide step-by-step explanations. The explanations show that LLMs can solve two of the main hurdles: correspondence finding (object and attribute identification) and item novelty. However, overall they struggle to find the correct underlying global rules, even when they find the right answer. We argue that these findings support the usefulness of the task as a method to test the limits and specific properties of generalisation ability in Large Language Models, providing an intrinsic evaluation method inspired by tests of human intelligence.
CITATION STYLE
Merlo, P. (2023). Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Can Large Language Models pass the test? In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 8119–8152). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-emnlp.546
Mendeley helps you to discover research relevant for your work.