Many modern convolutional neural neworks (CNNs) rely on bottleneck block structures where the activation tensor is mapped between higher dimensions using an intermediate low dimension, and convolved with depthwise feature filters rather than multichannel filters. Because most of the computation lies in computing the large dimensional tensors, however, such networks cannot be scaled without significant computation costs. In this paper, we show how fusing the layers inside these blocks can dramatically reduce the multiplication count (by 6-20×) at the cost of extra additions. ReLU nonlinearities are predicted dynamically, and only the activations that survive ReLU contribute to directly compute the output of the block. We also propose FusioNet, a CNN architecture optimized for fusion, as well as Archon, a novel accelerator design with a dataflow optimized for fused networks. When FusioNet is executed on the proposed accelerator, it yields up to 5.8× faster inference compared to compact networks executed on a dense DNN accelerator, and 2.1 × faster inference compared to the same networks when pruned and executed on a sparse DNN accelerator.
CITATION STYLE
Olyaiy, M. H., Ng, C., & Lis, M. (2021). Accelerating DNNS inference with predictive layer fusion. In Proceedings of the International Conference on Supercomputing (pp. 291–303). Association for Computing Machinery. https://doi.org/10.1145/3447818.3460378
Mendeley helps you to discover research relevant for your work.