Most contemporary processors offer some version of Single InstructionMultiple Data (SIMD) machinery — vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to im- prove the performance of the tail of recursive algorithms. When a recursive compu- tation reaches a set threshold, data is loaded into the vector registers, manipulated in- register, and the result stored back to memory. Four implementations of sorting with two different SIMD machinery — x86-64’s SSE2 and G5’s AltiVec — demonstrate that this idea delivers significant performance improvement. The improvements pro- vided by the tail optimization of sorting are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm . When integrated with the Dy- namically Tuned Sorting Library (DTSL), this new code generation strategy improves the performance ofDTSL by up to 18%. Performance of d-heaps is similarly improved by up to 35%.
Mendeley saves you time finding and organizing research
Choose a citation style from the tabs below