Deep learning is a vital technology in our lives today. Both the size of training datasets and neural networks are growing to tackle more challenging problems with deep learning. Distributed deep neural network (DDNN) training technique is necessary to train a model with large datasets and networks. For large-scale DDNN training, HPC clusters are excellent computation environments. I/O performance is critical in large-scale DDNN on HPC clusters because it is becoming a bottleneck. Most flagship-class HPC clusters have hierarchical storage systems. It is necessary to quantify the performance improvement effect of the hierarchical storage system on the workloads to design future HPC storage systems. This study demonstrates the quantitative performance analysis of the hierarchical storage system for DDNN workload in a flagship-class supercomputer. Our analysis shows how much performance improvement and storage volume increment will be required to achieve the performance goal.
CITATION STYLE
Fukai, T., Sato, K., & Hirofuchi, T. (2023). Analyzing I/O Performance of a Hierarchical HPC Storage System for Distributed Deep Learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13798 LNCS, pp. 81–93). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-29927-8_7
Mendeley helps you to discover research relevant for your work.