Bucket Getter: A Bucket-based Processing Engine for Low-bit Block Floating Point (BFP) DNNs

Yun Chen Lo; Ren Shuo Liu

Conference ProceedingsOPEN ACCESS

Bucket Getter: A Bucket-based Processing Engine for Low-bit Block Floating Point (BFP) DNNs

Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2023 (2023) 1002-1015

DOI: 10.1145/3613424.3614249

13Citations

10Readers

Get full text

Abstract

Block floating point (BFP), an efficient numerical system for deep neural networks (DNNs), achieves a good trade-off between dynamic range and hardware costs. Specifically, prior works have demonstrated that BFP format with 3 ∼5-bit mantissa can achieve FP32-comparable accuracy for various DNN workloads. We find that the floating-point adder (FP-Acc), which contains modules for normalization, alignment, addition, and fixed-point-to-floating-point (FXP2FP) conversion, dominates the power and area overheads, hence hindering the hardware efficiency of state-of-the-art low-bit BFP processing engines (BFP-PE). To mitigate the identified issue, we propose Bucket Getter, a novel architecture implemented with the following techniques for improving the energy efficiency and area efficiency: 1) we propose a bucket-based accumulation unit prior to FP-Acc, which uses multiple small accumulators (buckets) that are responsible for a small range of exponent values where intermediate results are distributed accordingly, and b) accumulate in FXP domain. This reduces the activities of power-hungry a) alignment and b) format conversion units. 2) We propose inter-bucket carry propagation, which allows each bucket to transmit overflow to an adjacent bucket and further reduces the activity of FP-Acc. 3) We propose an out-of-bound-aware, adaptive and circular bucket accumulator to significantly reduce the overhead for the bucket-based accumulator. 4) We further propose shared FP-Acc, which exploits the low activity of FP-Acc in the bucket-based architecture and shares an FP-Acc across several MAC engines to reduce the area overhead of FP-Acc. The experimental results based on TSMC 40 nm demonstrate that our proposed Bucket Getter architecture reduces the computational energy by up to 57% and improves the area efficiency by up to 1.4 ×, compared to state-of-the-art BFP engines across seven representative DNN models. Furthermore, our proposed approach helps state-of-the-art floating-point engines reduce up to 32% of the PE area and 81% of the PE power.

Author supplied keywords

Cite

CITATION STYLE

APA

Lo, Y. C., & Liu, R. S. (2023). Bucket Getter: A Bucket-based Processing Engine for Low-bit Block Floating Point (BFP) DNNs. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2023 (pp. 1002–1015). Association for Computing Machinery, Inc. https://doi.org/10.1145/3613424.3614249

Bucket Getter: A Bucket-based Processing Engine for Low-bit Block Floating Point (BFP) DNNs

Abstract

Author supplied keywords

Cite

Register to see more suggestions