Abstract
Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the “Build Your Own X” repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling.
Cite
CITATION STYLE
Sarim, M., Masood, F., Maheshwari, M., Faridi, A. R., & Shamsan, A. H. (2025). Generating reliable software project task flows using large language models through prompt engineering and robust evaluation. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-19170-9
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.