A shared matrix unit for a chip multi-core processor

5Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper proposes extending a multi-core processor with a common matrix unit to maximize onchip resource utilization and to leverage the advantages of the current multi-core revolution to improve the performance of data-parallel applications. Each core fetches scalar/vector/matrix instructions from its instruction cache. Scalar instructions continue the execution on the scalar datapath; however, vector/matrix instructions are issued by the decode stage to the shared matrix unit through the corresponding FIFO queue. Moreover, scalar results from reduction vector/matrix instructions are sent back from the matrix unit to the scalar core that sent these instructions. Some dense linear algebra kernels (scalar-vector multiplication, scalar times vector plus another, apply Givens rotation, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication) as well as discrete cosine transform, sum of absolute differences, and affine transformation are used in the performance evaluation. Our results show that the improvement in the utilization of the shared matrix unit with a dual-core ranges from 9% to 26% compared to extending a matrix unit to a single-core. Moreover, the average speedup of the dualcore shared matrix unit over a single-core extended with a matrix unit ranges from 6% to 24% and the maximum speedup ranges from 13% to 46%. © 2013 Elsevier Inc.

Cite

CITATION STYLE

APA

Soliman, M. I., & Al-Junaid, A. F. (2013). A shared matrix unit for a chip multi-core processor. Journal of Parallel and Distributed Computing, 73(8), 1146–1156. https://doi.org/10.1016/j.jpdc.2013.03.004

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free