FPGA acceleration of deep reinforcement learning using on-chip replay management

Yuan Meng; Chi Zhang; Viktor Prasanna

Conference ProceedingsOPEN ACCESS

FPGA acceleration of deep reinforcement learning using on-chip replay management

ACM International Conference Proceeding Series (2022) 40-48

DOI: 10.1145/3528416.3530227

7Citations

8Readers

Abstract

A major bottleneck in parallelizing deep reinforcement learning (DRL) is in the high latency to perform various operations used to update the Prioritized Replay Buffer on CPU. The low arithmetic intensity of these operations leads to severe under-utilization of the SIMT computation power of GPUs. In this work, we propose a high-throughput on-chip accelerator for Prioritized Replay Buffer and learner that efficient allocates computation and memory resources to saturate the FPGA computation power. Our design features hardware pipelining on FPGA such that the latency of replay operations is completely hidden. Our experimental results show that the performance of the key operations in managing Prioritized Replay Buffer including sampling and priority insertions are improved by factor of 21X 40X compared with the state-of-the-art implementations on CPU and GPU. In addition, our system design leads to up to 4.3X improvement in overall throughput compared with the state-of-the-art CPU-GPU implementations.

Author supplied keywords

Cite

CITATION STYLE

APA

Meng, Y., Zhang, C., & Prasanna, V. (2022). FPGA acceleration of deep reinforcement learning using on-chip replay management. In ACM International Conference Proceeding Series (pp. 40–48). Association for Computing Machinery. https://doi.org/10.1145/3528416.3530227

FPGA acceleration of deep reinforcement learning using on-chip replay management

Abstract

Author supplied keywords

Cite

Register to see more suggestions