Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

16Citations
Citations of this article
38Readers
Mendeley users who have this article in their library.

Abstract

Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human-like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions. This work tests whether stability and accuracy are showcased by GPT-3/ text-davinci- 002, GPT-3/ text-davinci- 003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes-response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.

References Powered by Scopus

On the dangers of stochastic parrots: Can language models be too big?

2936Citations
N/AReaders
Get full text

Agreement attraction in comprehension: Representations and processes

345Citations
N/AReaders
Get full text

Adult reformulations of child errors as negative evidence

245Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Can Generative AI improve social science?

35Citations
N/AReaders
Get full text

Evaluating the Language Abilities of Large Language Models vs. Humans: Three Caveats

5Citations
N/AReaders
Get full text

Language models align with human judgments on key grammatical constructions

4Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Dentella, V., Günther, F., & Leivada, E. (2023). Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias. Proceedings of the National Academy of Sciences of the United States of America, 120(51). https://doi.org/10.1073/pnas.2309583120

Readers' Seniority

Tooltip

Professor / Associate Prof. 9

75%

PhD / Post grad / Masters / Doc 2

17%

Researcher 1

8%

Readers' Discipline

Tooltip

Linguistics 4

36%

Psychology 3

27%

Business, Management and Accounting 3

27%

Social Sciences 1

9%

Article Metrics

Tooltip
Mentions
Blog Mentions: 3
News Mentions: 1

Save time finding and organizing research with Mendeley

Sign up for free