A major bottleneck in parallelizing deep reinforcement learning (DRL) is in the high latency to perform various operations used to update the Prioritized Replay Buffer on CPU. The low arithmetic intensity of these operations leads to severe under-utilization of the SIMT computation power of GPUs. In this work, we propose a high-throughput on-chip accelerator for Prioritized Replay Buffer and learner that efficient allocates computation and memory resources to saturate the FPGA computation power. Our design features hardware pipelining on FPGA such that the latency of replay operations is completely hidden. Our experimental results show that the performance of the key operations in managing Prioritized Replay Buffer including sampling and priority insertions are improved by factor of 21X 40X compared with the state-of-the-art implementations on CPU and GPU. In addition, our system design leads to up to 4.3X improvement in overall throughput compared with the state-of-the-art CPU-GPU implementations.
CITATION STYLE
Meng, Y., Zhang, C., & Prasanna, V. (2022). FPGA acceleration of deep reinforcement learning using on-chip replay management. In ACM International Conference Proceeding Series (pp. 40–48). Association for Computing Machinery. https://doi.org/10.1145/3528416.3530227
Mendeley helps you to discover research relevant for your work.