Recently, large language models (LLMs) have shown an extraordinary ability to understand natural language and generate programming code. It has been a common practice for software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability, and robustness of the code generation from LLMs have not yet been thoroughly studied. The executable code is not equivalent to reliable and robust code, especially in the context of real-world software development. For example, the misuse of APIs in the generated code could lead to severe problems, such as resource leaks, program crashes, etc. Existing code evaluation benchmarks and datasets focus on crafting small tasks such as programming questions in coding interviews. However, this deviates from the problems developers typically consult LLMs about. To fill the missing piece, we propose a dataset ROBUSTAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from Stack Overflow on 18 representative Java APIs. We summarize the common misuse patterns of these APIs and evaluate them on current popular LLMs. The evaluation results show that even GPT-4 has 62% of the generated code that contains API misuses. It would cause unexpected consequences if the code is introduced into real-world software.
CITATION STYLE
Zhong, L., & Wang, Z. (2024). Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, pp. 21841–21849). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i19.30185
Mendeley helps you to discover research relevant for your work.