Fully On-Chip MAC at 14 nm Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format

P. Narayanan; S. Ambrogio; A. Okazaki; K. Hosokawa; H. Tsai; A. Nomura; T. Yasuda; C. Mackin; S. C. Lewis; A. Friz; M. Ishii; Y. Kohda; H. Mori; K. Spoon; R. Khaddam-Aljameh; N. Saulnier; M. Bergendahl; J. Demarest; K. W. Brew; V. Chan; S. Choi; I. Ok; I. Ahsan; F. L. Lie; W. Haensch; V. Narayanan; G. W. Burr

Journal ArticleOPEN ACCESS

Fully On-Chip MAC at 14 nm Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format

IEEE Transactions on Electron Devices (2021) 68(12) 6629-6636

DOI: 10.1109/TED.2021.3115993

53Citations

51Readers

Abstract

Hardware acceleration of deep learning using analog non-volatile memory (NVM) requires large arrays with high device yield, high accuracy Multiply-ACcumulate (MAC) operations, and routing frameworks for implementing arbitrary deep neural network (DNN) topologies. In this article, we present a 14-nm test-chip for Analog AI inference - it contains multiple arrays of phase change memory (PCM)-devices, each array capable of storing 512times 512 unique DNN weights and executing massively parallel MAC operations at the location of the data. DNN excitations are transported across the chip using a duration representation on a parallel and reconfigurable 2-D mesh. To accurately transfer inference models to the chip, we describe a closed-loop tuning (CLT) algorithm that programs the four PCM conductances in each weight, achieving <3% average weight-error. A row-wise programming scheme and associated circuitry allow us to execute CLT on up to 512 weights concurrently. We show that the test chip can achieve near-software-equivalent accuracy on two different DNNs. We demonstrate tile-to-tile transport with a fully-on-chip two-layer network for MNIST (accuracy degradation 0.6%) and show resilience to error propagation across long sequences (up to 10 000 characters) with a recurrent long short-term memory (LSTM) network, implementing off-chip activation and vector-vector operations to generate recurrent inputs used in the next on- chip MAC.

Author supplied keywords

Cite

CITATION STYLE

APA

Narayanan, P., Ambrogio, S., Okazaki, A., Hosokawa, K., Tsai, H., Nomura, A., … Burr, G. W. (2021). Fully On-Chip MAC at 14 nm Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format. IEEE Transactions on Electron Devices, 68(12), 6629–6636. https://doi.org/10.1109/TED.2021.3115993

Fully On-Chip MAC at 14 nm Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format

Abstract

Author supplied keywords

Cite

Register to see more suggestions