Abstract
Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sort-merge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order. We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38x performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49x improvement over sequential Radix Sort, and 5.54x improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python.
Author supplied keywords
Cite
CITATION STYLE
Kristo, A., Vaidya, K., Çetintemel, U., Misra, S., & Kraska, T. (2020). The Case for a Learned Sorting Algorithm. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1001–1016). Association for Computing Machinery. https://doi.org/10.1145/3318464.3389752
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.