MatSciBERT: A materials domain language model for text mining and information extraction

Tanishq Gupta; Mohd Zaki; N. M.Anoop Krishnan; undefined Mausam

Journal ArticleOPEN ACCESS

MatSciBERT: A materials domain language model for text mining and information extraction

npj Computational Materials (2022) 8(1)

DOI: 10.1038/s41524-022-00784-w

109Citations

162Readers

Abstract

A large amount of materials science knowledge is generated and stored as text published in peer-reviewed scientific literature. While recent developments in natural language processing, such as Bidirectional Encoder Representations from Transformers (BERT) models, provide promising information extraction tools, these models may yield suboptimal results when applied on materials domain since they are not trained in materials science specific notations and jargons. Here, we present a materials-aware language model, namely, MatSciBERT, trained on a large corpus of peer-reviewed materials science publications. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, and establish state-of-the-art results on three downstream tasks, named entity recognition, relation classification, and abstract classification. We make the pre-trained weights of MatSciBERT publicly accessible for accelerated materials discovery and information extraction from materials science texts.

Cite

CITATION STYLE

APA

Gupta, T., Zaki, M., Krishnan, N. M. A., & Mausam. (2022). MatSciBERT: A materials domain language model for text mining and information extraction. Npj Computational Materials, 8(1). https://doi.org/10.1038/s41524-022-00784-w

MatSciBERT: A materials domain language model for text mining and information extraction

Abstract

Cite

Register to see more suggestions