Atomic vector operations on chip multiprocessors

  • Kumar S
  • Kim D
  • Smelyanskiy M
 et al. 
  • 39


    Mendeley users who have this article in their library.
  • 21


    Citations of this article.


The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). Vector parallelism can be more efficiently supported than multithreading, but is often harder for software to exploit. In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors. Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes. However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. This paper proposes architectural support for atomic vector operations (referred to as GLSC) that addresses this limitation. GLSC extends scatter-gather hardware to support atomic memory operations. Our experiments show that the GLSC provides an average performance improvement on a set of important RMS kernels of 54% for 4-wide SIMD.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • Sanjeev Kumar

  • Daehyun Kim

  • Mikhail Smelyanskiy

  • Yen Kuang Chen

  • Jatin Chhugani

  • Christopher J. Hughes

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free