The attention-based neural network attracts great interest due to its excellent accuracy enhancement. However, the attention mechanism requires huge computational efforts to process unnecessary calculations, significantly limiting the system's performance. To reduce the unnecessary calculations, researchers propose sparse attention to convert some dense-dense matrices multiplication (DDMM) operations to sampled dense-dense matrix multiplication (SDDMM) and sparse matrix multiplication (SpMM) operations. However, current sparse attention solutions introduce massive off-chip random memory access since the sparse attention matrix is generally unstructured. We propose CPSAA, a novel crossbar-based processing-in-memory (PIM)-featured sparse attention accelerator to eliminate off-chip data transmissions. 1) We present a novel attention calculation mode to balance the crossbar writing and crossbar processing latency. 2) We design a novel PIM-based sparsity pruning architecture to eliminate the pruning phase's off-chip data transfers. 3) Finally, we present novel crossbar-based SDDMM and SpMM methods to process unstructured sparse attention matrices by coupling two types of crossbar arrays. Experimental results show that CPSAA has an average of 89.6× , 32.2× , 17.8× , 3.39× , and 3.84× performance improvement and 755.6× , 55.3× , 21.3× , 5.7× , and 4.9× energy-saving when compare with GPU, field programmable gate array, SANGER, ReBERT, and ReTransformer.
CITATION STYLE
Li, H., Jin, H., Zheng, L., Liao, X., Huang, Y., Liu, C., … Gui, C. (2024). CPSAA: Accelerating Sparse Attention Using Crossbar-Based Processing-In-Memory Architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 43(6), 1741–1754. https://doi.org/10.1109/TCAD.2023.3344524
Mendeley helps you to discover research relevant for your work.