HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models

Mathieu Chartier; Nabil Dakkoune; Guillaume Bourgeois; Stéphane Jean

Journal Article

HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models

Data and Knowledge Engineering (2025) 156

DOI: 10.1016/j.datak.2024.102383

5Citations

13Readers

Get full text

Abstract

Large Language Models (LLMs) such as ChatGPT or Bard have significantly transformed information retrieval and captured the public's attention with their ability to generate customized responses across various topics. In this paper, we analyze the capabilities of different LLMs to generate responses related to historical facts in French. Our objective is to evaluate their reliability, comprehensiveness, and relevance for direct usability or extraction. To accomplish this, we propose a benchmark consisting of numerous historical questions covering various types, themes, and difficulty levels. Our evaluation of responses provided by 14 selected LLMs reveals several limitations in both content and structure. In addition to an overall insufficient precision rate, we observe uneven treatment of the French language, along with issues related to verbosity and inconsistency in the responses generated by LLMs.

Author supplied keywords

Cite

CITATION STYLE

APA

Chartier, M., Dakkoune, N., Bourgeois, G., & Jean, S. (2025). HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models. Data and Knowledge Engineering, 156. https://doi.org/10.1016/j.datak.2024.102383

HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models

Abstract

Author supplied keywords

Cite

Register to see more suggestions