Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

  • Furtak T
  • Amaral J
  • Niewiadomski R
  • 21


    Mendeley users who have this article in their library.
  • 23


    Citations of this article.


Most contemporary processors offer some version of Single InstructionMultiple Data (SIMD) machinery — vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to im- prove the performance of the tail of recursive algorithms. When a recursive compu- tation reaches a set threshold, data is loaded into the vector registers, manipulated in- register, and the result stored back to memory. Four implementations of sorting with two different SIMD machinery — x86-64’s SSE2 and G5’s AltiVec — demonstrate that this idea delivers significant performance improvement. The improvements pro- vided by the tail optimization of sorting are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [10]. When integrated with the Dy- namically Tuned Sorting Library (DTSL), this new code generation strategy improves the performance ofDTSL by up to 18%. Performance of d-heaps is similarly improved by up to 35%.

Author-supplied keywords

  • instruction-
  • level parallelism
  • quicksort
  • simd
  • sorting
  • sorting networks
  • vectorization

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • Timothy Furtak

  • José Nelson Amaral

  • Robert Niewiadomski

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free