Can ChatGPT and Bard generate aligned assessment items? A reliability analysis against human performance

18Citations
Citations of this article
128Readers
Mendeley users who have this article in their library.

Abstract

ChatGPT and Bard are AI chatbots based on Large Language Models (LLM) that are slated to promise different applications in diverse areas. In education, these AI technologies have been tested for applications in assessment and teaching. In assessment, AI has long been used in automated essay scoring and automated item generation. One psychometric property that these tools must have to assist or replace humans in assessment is high reliability in terms of agreement between AI scores and human raters. In this paper, the reliability of OpenAI’s ChatGPT and Google’s Bard LLMs tools against experienced and trained humans in perceiving and rating the complexity of writing prompts is measured. Intraclass correlation (ICC) as a performance metric showed that the reliability of both ChatGPT and Bard was low against the gold standard of human ratings.

Cite

CITATION STYLE

APA

Khademi, A. (2023). Can ChatGPT and Bard generate aligned assessment items? A reliability analysis against human performance. Journal of Applied Learning and Teaching, 6(1), 75–80. https://doi.org/10.37074/jalt.2023.6.1.28

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free