Accelerating DNNS inference with predictive layer fusion

8Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

Abstract

Many modern convolutional neural neworks (CNNs) rely on bottleneck block structures where the activation tensor is mapped between higher dimensions using an intermediate low dimension, and convolved with depthwise feature filters rather than multichannel filters. Because most of the computation lies in computing the large dimensional tensors, however, such networks cannot be scaled without significant computation costs. In this paper, we show how fusing the layers inside these blocks can dramatically reduce the multiplication count (by 6-20×) at the cost of extra additions. ReLU nonlinearities are predicted dynamically, and only the activations that survive ReLU contribute to directly compute the output of the block. We also propose FusioNet, a CNN architecture optimized for fusion, as well as Archon, a novel accelerator design with a dataflow optimized for fused networks. When FusioNet is executed on the proposed accelerator, it yields up to 5.8× faster inference compared to compact networks executed on a dense DNN accelerator, and 2.1 × faster inference compared to the same networks when pruned and executed on a sparse DNN accelerator.

Cite

CITATION STYLE

APA

Olyaiy, M. H., Ng, C., & Lis, M. (2021). Accelerating DNNS inference with predictive layer fusion. In Proceedings of the International Conference on Supercomputing (pp. 291–303). Association for Computing Machinery. https://doi.org/10.1145/3447818.3460378

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free