Resource and memory management techniques for the high-level synthesis of software threads into parallel FPGA hardware

  • Choi J
  • Brown S
  • Anderson J
  • 9

    Readers

    Mendeley users who have this article in their library.
  • 3

    Citations

    Citations of this article.

Abstract

Recent work has proposed the high-level synthesis of parallel software programs (specified using Pthreads or OpenMP) into concurrently operating parallel hardware modules [6]. In this paper, we describe resource and memory management techniques for improving performance and area of hardware generated by such software thread synthesis. One direction investigated pertains to how modules in the HLS-generated parallel hardware should connect to one another: 1) with a nested topology, or 2) with a flat topology. In the nested topology, hardware modules are created in a hierarchical manner: modules are instantiated inside within modules that use them. Conversely, the flat topology instantiates all hardware modules at the same level of hierarchy. For the flat topology, we describe a system generator that automatically generates the required interconnect between all hardware modules, as well as flexibly shares or replicates functions, functional units, and memories. We also explore methods to reduce memory contention among hardware units that operate in parallel, by investigating three different memory architectures which use: 1) a global memory controller, 2) local memories, and 3) shared-local memories. Local and shared-local memories are dedicated RAM blocks for a single or a set of hardware modules, and help to increase memory bandwidth by allowing concurrent memory accesses. We also consider memory replication to localize memories in hardware modules, and convert small memories to registers to further improve performance and memory usage. Finally, we describe implementing locks and barriers in HLS hardware: synchronization constructs used in parallel programming. We show that with our resource and memory management techniques, we can improve the geomean performance, area, and area-delay product of parallel HLS-generated hardware up to 41.6%, 38.3%, and 63.3%, respectively, for a set of 15 benchmarks.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Authors

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free