Fast segmented sort on GPUs

Kaixi Hou; Weifeng Liu; Hao Wang; Wu Chun Feng

Conference Proceedings

Fast segmented sort on GPUs

Proceedings of the International Conference on Supercomputing (2017) Part F128411

DOI: 10.1145/3079079.3079105

43Citations

9Readers

Get full text

Abstract

Segmented sort, as a generalization of classical sort, orders a batch of independent segments in a whole array. Along with the wider adoption of manycore processors for HPC and big data applications, segmented sort plays an increasingly important role than sort. In this paper, we present an adaptive segmented sort mechanism on GPUs. Our mechanisms include two core techniques: (1) a differentiated method for different segment lengths to eliminate the irregularity caused by various workloads and thread divergence; and (2) a register-based sort method to support N-to-M data-thread binding and in-register data communication. We also implement a shared memory-based merge method to support non-uniform length chunk merge via multiple warps. Our segmented sort mechanism shows great improvements over the methods from CUB, CUSP and ModernGPU on NVIDIA K80-Kepler and TitanX-Pascal GPUs. Furthermore, we apply our mechanism on two applications, i.e., sufix array construction and sparse matrix-matrix multiplication, and obtain obvious gains over state-of-the-art implementations.

Author supplied keywords

Cite

CITATION STYLE

APA

Hou, K., Liu, W., Wang, H., & Feng, W. C. (2017). Fast segmented sort on GPUs. In Proceedings of the International Conference on Supercomputing (Vol. Part F128411). Association for Computing Machinery. https://doi.org/10.1145/3079079.3079105

Fast segmented sort on GPUs

Abstract

Author supplied keywords

Cite

Register to see more suggestions