LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI Accelerators

Chengtao Lai; Zhongchun Zhou; Akash Poptani; Wei Zhang

Conference ProceedingsOPEN ACCESS

LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI Accelerators

Proceedings of the International Conference on Supercomputing (2024) 62-73

DOI: 10.1145/3650200.3656592

3Citations

7Readers

Abstract

The proliferation of large language models (LLMs) with substantial computational requirements and memory footprints has necessitated the design of more capable AI accelerators. Given the long compilation time of scratchpad memory-based (SPM-based) AI accelerators and the challenges brought by LLMs, we have explored the other side of the tradeoff - a multi-core AI accelerator system that incorporates a shared cache and application-specific management strategies - to provide significantly shorter compilation time at the cost of sometimes slightly lower performance than SPM-based systems. Besides, state-of-the-art mixed precision quantization methods also bring dynamic and irregular memory access patterns that do not fit SPMs well. Experiments conducted on cycle-accurate simulators have shown a maximum 30% memory time reduction with our cache management strategies, and up to 1.37x overall execution speedup with our prefetcher. Furthermore, we integrate the cache model into a state-of-the-art AI accelerator analytical model, allowing for a rapid understanding of the impact of caches for large-scale systems and big ML models. Overall, a system with our managed cache performs comparably with that using a conventional global SPM in non-quantized test-cases. However, our managed cache can outperform an SPM by 50.5% when a mixed precision quantization algorithm is used. With our cache management schemes, the proposed system can achieve a 2.57x speedup compared to LRU caches. Finally, we implement the design in RTL and the area of our design is 0.218mm2 with 15nm process, which can run at 1 GHz clock frequency. Our findings explore the potential of a shared cache design to assist the development of future AI accelerator systems.

Author supplied keywords

Cite

CITATION STYLE

APA

Lai, C., Zhou, Z., Poptani, A., & Zhang, W. (2024). LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI Accelerators. In Proceedings of the International Conference on Supercomputing (pp. 62–73). Association for Computing Machinery. https://doi.org/10.1145/3650200.3656592

LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI Accelerators

Abstract

Author supplied keywords

Cite

Register to see more suggestions