In any parallel programming language; collective communication operations involve more than one thread/process and act on multiple streams of data. The language's API provides both algorithmic and run-time system support to optimize the performance of these operations. Some developers, however, choose to play clever and start from the language's primitive operations and write their own versions of the collective operations. The question that always pops up: Are these developers wise? In this paper, we check the case of UPC (Universal Parallel C) and prove that in some circumstances, it is wiser for developers to optimize starting from UPCs primitive operations. In our testing we found out that optimization using primitive UPC operations by the developers can have better performance than readily available UPCs collective operations. In this paper, we pin point specific optimizations at both the algorithmic and the runtime support levels that developers could use to uncover missed optimization opportunities. We also propose a novel approach to implementing UPC collective operations across clusters. Under this methodology, performance-critical components are moved close to the network. We argue that this provide unique advantages for performance improvement. © Springer-Verlag Berlin Heidelberg 2007.
CITATION STYLE
Salama, R. A., & Sameh, A. (2007). UPC collective operations optimization. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4705 LNCS, pp. 536–549). Springer Verlag. https://doi.org/10.1007/978-3-540-74472-6_44
Mendeley helps you to discover research relevant for your work.