From text to insight: large language models for chemical data extraction

Mara Schilling-Wilhelmi; Martiño Ríos-García; Sherjeel Shabih; María Victoria Gil; Santiago Miret; Christoph T. Koch; José A. Márquez; Kevin Maik Jablonka

ArticleOPEN ACCESS

From text to insight: large language models for chemical data extraction

Chemical Society Reviews

DOI: 10.1039/d4cs00913d

82Citations

160Readers

Abstract

The vast majority of chemical knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling non-experts to extract structured, actionable data from unstructured text efficiently. While applying LLMs to chemical and materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This tutorial review provides a comprehensive overview of LLM-based structured data extraction in chemistry, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and chemical expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven chemical research. The insights presented here could significantly enhance how researchers across chemical disciplines access and utilize scientific information, potentially accelerating the development of novel compounds and materials for critical societal needs.

Cite

CITATION STYLE

APA

Schilling-Wilhelmi, M., Ríos-García, M., Shabih, S., Gil, M. V., Miret, S., Koch, C. T., … Jablonka, K. M. (2024, December 20). From text to insight: large language models for chemical data extraction. Chemical Society Reviews. Royal Society of Chemistry. https://doi.org/10.1039/d4cs00913d

From text to insight: large language models for chemical data extraction

Abstract

Cite

Register to see more suggestions