Abstract
Large Language Models (LLMs) such as ChatGPT or Bard have significantly transformed information retrieval and captured the public's attention with their ability to generate customized responses across various topics. In this paper, we analyze the capabilities of different LLMs to generate responses related to historical facts in French. Our objective is to evaluate their reliability, comprehensiveness, and relevance for direct usability or extraction. To accomplish this, we propose a benchmark consisting of numerous historical questions covering various types, themes, and difficulty levels. Our evaluation of responses provided by 14 selected LLMs reveals several limitations in both content and structure. In addition to an overall insufficient precision rate, we observe uneven treatment of the French language, along with issues related to verbosity and inconsistency in the responses generated by LLMs.
Author supplied keywords
Cite
CITATION STYLE
Chartier, M., Dakkoune, N., Bourgeois, G., & Jean, S. (2025). HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models. Data and Knowledge Engineering, 156. https://doi.org/10.1016/j.datak.2024.102383
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.