The development of fine-grain multi-threaded program ex-ecution models has created an interesting challenge: how to partition a program into threads that can exploit machine parallelism, achieve latency tolerance, and maintain reasonable locality of reference? A suc-cessful algorithm must produce a thread partition that best utilizes mul-tiple execution units on a single processing node and handles long and unpredictable latencies. In this paper, we introduce a new thread partitioning algorithm that can meet the above challenge for a range of machine architecture models. A quantitative aFFInity heuristic is introduced to guide the placement of operations into threads. This heuristic addresses the trade-off between exploiting parallelism and preserving locality. The algorithm is surpris-ingly simple due to the use of a time-ordered event list to account for the multiple execution unit activities. We have implemented the proposed al-gorithm and our experiments, performed on a wide range of examples, have demonstrated its eFFIciency and effectiveness.
CITATION STYLE
Amaral, J. N., Gao, G., Kocalar, E. D., O’Neill, P., & Tang, X. (2000). Design and implementation of an efficient thread partitioning algorithm. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1940, pp. 252–259). Springer Verlag. https://doi.org/10.1007/3-540-39999-2_22
Mendeley helps you to discover research relevant for your work.