Generating reliable software project task flows using large language models through prompt engineering and robust evaluation

Mohammed Sarim; Faraz Masood; Manas Maheshwari; Arman Rasool Faridi; Ali Haider Shamsan

Journal ArticleOPEN ACCESS

Generating reliable software project task flows using large language models through prompt engineering and robust evaluation

Scientific Reports (2025) 15(1)

DOI: 10.1038/s41598-025-19170-9

3Citations

41Readers

Abstract

Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the “Build Your Own X” repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling.

Cite

CITATION STYLE

APA

Sarim, M., Masood, F., Maheshwari, M., Faridi, A. R., & Shamsan, A. H. (2025). Generating reliable software project task flows using large language models through prompt engineering and robust evaluation. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-19170-9

Generating reliable software project task flows using large language models through prompt engineering and robust evaluation

Abstract

Cite

Register to see more suggestions