From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

Yuhan Chen; Nuwa Xi; Yanrui Du; Haochun Wang; Jianyu Chen; Sendong Zhao; Bing Qin

Conference ProceedingsOPEN ACCESS

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

Proceedings of the AAAI Conference on Artificial Intelligence (2024) 38(20) 21958-21966

DOI: 10.1609/aaai.v38i20.30198

5Citations

8Readers

Abstract

Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery.

Cite

CITATION STYLE

APA

Chen, Y., Xi, N., Du, Y., Wang, H., Chen, J., Zhao, S., & Qin, B. (2024). From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, pp. 21958–21966). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i20.30198

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

Abstract

Cite

Register to see more suggestions