Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Umit V. Ucak; Islambek Ashyrmamatov; Juyong Lee

Journal ArticleOPEN ACCESS

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Journal of Cheminformatics (2023) 15(1)

DOI: 10.1186/s13321-023-00725-9

7Citations

35Readers

Abstract

Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.

Author supplied keywords

Cite

CITATION STYLE

APA

Ucak, U. V., Ashyrmamatov, I., & Lee, J. (2023). Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. Journal of Cheminformatics, 15(1). https://doi.org/10.1186/s13321-023-00725-9

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Abstract

Author supplied keywords

Cite

Register to see more suggestions