Abstract
Existing research on chatbot evaluation suffers from inconsistent assessment standards, fragmented criteria, and insufficient coverage of critical dimensions like legal compliance and ethical alignment, which hinders reliable benchmarking of chatbots' performance. Our study proposes a comprehensive framework for such evaluation and systematically compares five chatbot systems: Tidio (Rule-Based), GPT-4o (AI-Powered), Claude 3.5 Sonnet (LLM), Watson Assistant (Enterprise), and Qwen2.5-Max (Multilingual) in terms of their accuracy, safety, legal compliance, generalizability of performance, and ethical alignment. We conclude that while chatbots enhance efficiency in healthcare (97.34% patient education completeness) and e-commerce (30%-40% cost reduction), critical limitations persist. Recommendations include: (1) retrieval-augmented generation (RAG) for hallucination reduction, (2) ethical governance frameworks (e.g., AILuminate), and (3) domain-specialized tuning. Cross-sector collaboration and standardized evaluations are essential for responsible deployment of AI.
Author supplied keywords
Cite
CITATION STYLE
Xu, H., Wan, L., Li, Y., Liu, J., & Lau, A. S. M. (2025). Comparative Analysis of Chatbot Systems. In Frontiers in Artificial Intelligence and Applications (Vol. 412, pp. 392–398). IOS Press BV. https://doi.org/10.3233/FAIA250737
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.