In-memory computing with large last-level caches is promising to dramatically alleviate data movement bottlenecks and expose massive bitline-level parallelization opportunities. However, key challenges from its unique execution model remain unsolved: automated parallelization, transparently orchestrating data transposition/alignment/broadcast for bit-serial logic, and mixing in-/near-memory computing. Most importantly, the solution should be programmer friendly and portable across platforms. Our key innovation is an execution model and intermediate representation (IR) that enables hybrid CPU-core, in-memory, and near-memory processing. Our IR is the tensor dataflow graph (tDFG), which is a unified representation of in-memory and near-memory computation. The tDFG exposes tensor-data structure information so that the hardware and runtime can automatically orchestrate data management for bitserial execution, including runtime data layout transformations. To enable microarchitecture portability, we use a two-phase, JIT-based compilation approach to dynamically lower the the tDFG to in-memory commands. Our design, infinity stream, is evaluated on a cycle-accurate simulator. Across data-processing workloads with fp32, it achieves 2.6× speedup and 75% traffic reduction over a state-of-the-art near-memory computing technique, with 2.4× energy efficiency.
CITATION STYLE
Wang, Z., Liu, C., Arora, A., John, L., & Nowatzki, T. (2023). Infinity Stream: Portable and Programmer-Friendly In-/Near-Memory Fusion. In International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS (Vol. 3, pp. 359–375). Association for Computing Machinery. https://doi.org/10.1145/3582016.3582032
Mendeley helps you to discover research relevant for your work.