HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models

5Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Large Language Models (LLMs) such as ChatGPT or Bard have significantly transformed information retrieval and captured the public's attention with their ability to generate customized responses across various topics. In this paper, we analyze the capabilities of different LLMs to generate responses related to historical facts in French. Our objective is to evaluate their reliability, comprehensiveness, and relevance for direct usability or extraction. To accomplish this, we propose a benchmark consisting of numerous historical questions covering various types, themes, and difficulty levels. Our evaluation of responses provided by 14 selected LLMs reveals several limitations in both content and structure. In addition to an overall insufficient precision rate, we observe uneven treatment of the French language, along with issues related to verbosity and inconsistency in the responses generated by LLMs.

Cite

CITATION STYLE

APA

Chartier, M., Dakkoune, N., Bourgeois, G., & Jean, S. (2025). HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models. Data and Knowledge Engineering, 156. https://doi.org/10.1016/j.datak.2024.102383

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free