Abstract
The application of open science and machine learning to scientific, engineering, and industry-relevant problems is a critical component of the cross-department U.S. Artificial Intelligence (AI) strategy highlighted e.g., by the AI Initiative, the recent National AI Strategy report ("Strengthening and Democratizing the u.s. Artificial Intelligence Innovation Ecosystem-an Implementation Plan for a National Artificial Intelligence Research Resource," 2023), the Year of Open Data, Materials Genome Initiative (Pablo et al., 2019; Ward & Warren, 2015), and more. A key aspect of these strategies is to ensure that infrastructure exists to make datasets easily accessible for training, retraining, reproducing, and verifying model performance on chosen tasks. However, the discovery of high-quality, curated datasets adhering to the FAIR principles (findable, accessible, interoperable and reusable) remains a challenge. To overcome these dataset access challenges, we introduce Foundry-ML, software that combines several services to provide researchers capabilities to publish and discover structured datasets for ML in science, specifically in materials science and chemistry. Foundry-ML consists of a Python client, a web app, and standardized metadata and file structures built using services including the Materials Data Facility(Blaiszik et al., 2016, 2019) and Globus (Ananthakrishnan et al., 2018; Chard et al., 2015). Together, these services work in conjunction with Python software tooling to dramatically simplify data access patterns, as we show below. Statement of need The processes by which high-quality structured science datasets are published and accessed remains decentralized, without shared standards, and scattered with some exceptions (e.g., Wu et al. (2018)). With Foundry-ML, we provide 1) a simple Python interface that allows users to access structured ML-ready materials science and chemistry datasets with just a few lines of code, 2) a prototype web-based interface for dataset search and discovery, and 3) software that enables users to publish their own ML-ready datasets in a self-service manner. Schmidt et al. (2024). Foundry-ML-Software and Services to Simplify Access to Machine Learning Datasets in Materials Science. Journal of Open Source Software, 9(93), 5467. https://doi.org/10.21105/joss.05467.
Cite
CITATION STYLE
Schmidt, K., Scourtas, A., Ward, L., Wangen, S., Schwarting, M., Darling, I., … Blaiszik, B. (2024). Foundry-ML - Software and Services to Simplify Access to Machine Learning Datasets in Materials Science. Journal of Open Source Software, 9(93), 5467. https://doi.org/10.21105/joss.05467
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.