FPGA acceleration of deep reinforcement learning using on-chip replay management

7Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

A major bottleneck in parallelizing deep reinforcement learning (DRL) is in the high latency to perform various operations used to update the Prioritized Replay Buffer on CPU. The low arithmetic intensity of these operations leads to severe under-utilization of the SIMT computation power of GPUs. In this work, we propose a high-throughput on-chip accelerator for Prioritized Replay Buffer and learner that efficient allocates computation and memory resources to saturate the FPGA computation power. Our design features hardware pipelining on FPGA such that the latency of replay operations is completely hidden. Our experimental results show that the performance of the key operations in managing Prioritized Replay Buffer including sampling and priority insertions are improved by factor of 21X 40X compared with the state-of-the-art implementations on CPU and GPU. In addition, our system design leads to up to 4.3X improvement in overall throughput compared with the state-of-the-art CPU-GPU implementations.

Cite

CITATION STYLE

APA

Meng, Y., Zhang, C., & Prasanna, V. (2022). FPGA acceleration of deep reinforcement learning using on-chip replay management. In ACM International Conference Proceeding Series (pp. 40–48). Association for Computing Machinery. https://doi.org/10.1145/3528416.3530227

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free