Comparative Analysis of AI Models for Python Code Generation: A HumanEval Benchmark Study

Ali Bayram; Gonca Gokce Menekse Dalveren; Mohammad Derawi

Journal ArticleOPEN ACCESS

Comparative Analysis of AI Models for Python Code Generation: A HumanEval Benchmark Study

Applied Sciences (Switzerland) (2025) 15(18)

DOI: 10.3390/app15189907

0Citations

29Readers

Abstract

This study conducts a comprehensive comparative analysis of six contemporary artificial intelligence models for Python code generation using the HumanEval benchmark. The evaluated models include GPT-3.5 Turbo, GPT-4 Omni, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4, and Claude Opus 4. A total of 164 Python programming problems were utilized to assess model performance through a multi-faceted methodology incorporating automated functional correctness evaluation via the Pass@1 metric, cyclomatic complexity analysis, maintainability index calculations, and lines-of-code assessment. The results indicate that Claude Sonnet 4 achieved the highest performance with a success rate of 95.1%, followed closely by Claude Opus 4 at 94.5%. Across all metrics, models developed by Anthropic Claude consistently outperformed those developed by OpenAI GPT by margins exceeding 20%. Statistical analysis further confirmed the existence of significant differences between the model families (p < 0.001). Anthropic Claude models were observed to generate more sophisticated and maintainable solutions with superior syntactic accuracy. In contrast, OpenAI GPT models tended to adopt simpler strategies but exhibited notable limitations in terms of reliability. These findings offer evidence-based insights to guide the selection of AI-powered coding assistants in professional software development contexts.

Author supplied keywords

Cite

CITATION STYLE

APA

Bayram, A., Menekse Dalveren, G. G., & Derawi, M. (2025). Comparative Analysis of AI Models for Python Code Generation: A HumanEval Benchmark Study. Applied Sciences (Switzerland), 15(18). https://doi.org/10.3390/app15189907

Comparative Analysis of AI Models for Python Code Generation: A HumanEval Benchmark Study

Abstract

Author supplied keywords

Cite

Register to see more suggestions