Mutation-based consistency testing for evaluating the code understanding capability of llms

9Citations
Citations of this article
41Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages, which have enabled various applications in software engineering, such as requirement engineering, code generation, and software testing. However, existing code generation benchmarks do not necessarily assess the code understanding performance of LLMs, especially for the subtle inconsistencies that may arise between code and its semantics described in natural language.In this paper, we propose a novel method, called Mutation-based Consistency Testing (MCT), to systematically assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions, by introducing code mutations to existing code generation datasets. Code mutations are small changes that alter the semantics of the original code, creating a mismatch with the natural language description. MCT uses different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. MCT then uses these pairs to test the ability of LLMs to detect the inconsistencies correctly.We conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which consists of 164 programming problems written in six programming languages (Python, C++, Java, Go, JavaScript, and Rust). The results show that the LLMs have significant variations in their code understanding performance and that they have different strengths and weaknesses depending on the mutation type and language. We further explain conditions under which the LLMs result in correct answers using input characteristics (e.g., number of tokens) and investigate to what extent the test results can be improved using one-shot prompts (i.e., providing an additional example). Our MCT method and the case study results provide valuable implications for future research and development of LLM-based software engineering.

Cite

CITATION STYLE

APA

Li, Z., & Shin, D. (2024). Mutation-based consistency testing for evaluating the code understanding capability of llms. In Proceedings - 2024 IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI, CAIN 2024 (pp. 150–159). Association for Computing Machinery, Inc. https://doi.org/10.1145/3644815.3644946

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free