Abstract
This study conducts a comprehensive comparative analysis of six contemporary artificial intelligence models for Python code generation using the HumanEval benchmark. The evaluated models include GPT-3.5 Turbo, GPT-4 Omni, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4, and Claude Opus 4. A total of 164 Python programming problems were utilized to assess model performance through a multi-faceted methodology incorporating automated functional correctness evaluation via the Pass@1 metric, cyclomatic complexity analysis, maintainability index calculations, and lines-of-code assessment. The results indicate that Claude Sonnet 4 achieved the highest performance with a success rate of 95.1%, followed closely by Claude Opus 4 at 94.5%. Across all metrics, models developed by Anthropic Claude consistently outperformed those developed by OpenAI GPT by margins exceeding 20%. Statistical analysis further confirmed the existence of significant differences between the model families (p < 0.001). Anthropic Claude models were observed to generate more sophisticated and maintainable solutions with superior syntactic accuracy. In contrast, OpenAI GPT models tended to adopt simpler strategies but exhibited notable limitations in terms of reliability. These findings offer evidence-based insights to guide the selection of AI-powered coding assistants in professional software development contexts.
Author supplied keywords
Cite
CITATION STYLE
Bayram, A., Menekse Dalveren, G. G., & Derawi, M. (2025). Comparative Analysis of AI Models for Python Code Generation: A HumanEval Benchmark Study. Applied Sciences (Switzerland), 15(18). https://doi.org/10.3390/app15189907
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.