Generating Better Items for Cognitive Assessments Using Large Language Models

6Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.

Abstract

Writing high-quality test questions (items) is critical to building educational measures but has traditionally also been a time-consuming process. One promising avenue for alleviating this is automated item generation, whereby methods from artificial intelligence (AI) are used to generate new items with minimal human intervention. Researchers have explored using large language models (LLMs) to generate new items with equivalent psychometric properties to human-written ones. But can LLMs generate items with improved psychometric properties, even when existing items have poor validity evidence? We investigate this using items from a natural language inference (NLI) dataset. We develop a novel prompting strategy based on selecting items with both the best and worst properties to use in the prompt and use GPT-3 to generate new NLI items. We find that the GPT-3 items show improved psychometric properties in many cases, whilst also possessing good content, convergent and discriminant validity evidence. Collectively, our results demonstrate the potential of employing LLMs to ease the item development process and suggest that the careful use of prompting may allow for iterative improvement of item quality.

Cite

CITATION STYLE

APA

Laverghetta, A., & Licato, J. (2023). Generating Better Items for Cognitive Assessments Using Large Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 414–428). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.bea-1.34

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free