Recent work has proposed the high-level synthesis of parallel software programs (specified using Pthreads or OpenMP) into concurrently operating parallel hardware modules . In this paper, we describe resource and memory management techniques for improving performance and area of hardware generated by such software thread synthesis. One direction investigated pertains to how modules in the HLS-generated parallel hardware should connect to one another: 1) with a nested topology, or 2) with a flat topology. In the nested topology, hardware modules are created in a hierarchical manner: modules are instantiated inside within modules that use them. Conversely, the flat topology instantiates all hardware modules at the same level of hierarchy. For the flat topology, we describe a system generator that automatically generates the required interconnect between all hardware modules, as well as flexibly shares or replicates functions, functional units, and memories. We also explore methods to reduce memory contention among hardware units that operate in parallel, by investigating three different memory architectures which use: 1) a global memory controller, 2) local memories, and 3) shared-local memories. Local and shared-local memories are dedicated RAM blocks for a single or a set of hardware modules, and help to increase memory bandwidth by allowing concurrent memory accesses. We also consider memory replication to localize memories in hardware modules, and convert small memories to registers to further improve performance and memory usage. Finally, we describe implementing locks and barriers in HLS hardware: synchronization constructs used in parallel programming. We show that with our resource and memory management techniques, we can improve the geomean performance, area, and area-delay product of parallel HLS-generated hardware up to 41.6%, 38.3%, and 63.3%, respectively, for a set of 15 benchmarks.
Mendeley saves you time finding and organizing research
Choose a citation style from the tabs below