This paper examines the problem of data placement in Bubba, a highly-parallel system for data-intensive applications being developed at MCC. "Highly-parallel" implies that load balancing is a critical performance issue. tlData-intensive" means data is so large that operations should be executed where the data resides. As a result, data placement becomes a critical performance issue. In general, determining the optimal placement of data across processing nodes for performance is a difficult problem. We describe our heuristic approach to solving the data placement problem in Bubba. We then present experimental results using a specific workload to provide insight into the problem. Several researchers have argued the benefits of declustering (i.e., spreading each base relation over many nodes). We show that as declustering is increased, load balancing continues to improve. However, for transactions involving complex joins, further declustering reduces throughput because of communications, startup and termination overhead. We argue that data placement, especially declustering, in a highly-parallel system must be considered early in the design, so that mechanisms can be included for supporting variable declustering, for minimizing the most significant overheads associated with large-scale declustering, and for gathering the required statistics.
CITATION STYLE
Copeland, G., Alexander, W., Boughter, E., & Keller, T. (1988). Data placement in Bubba. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Vol. 1988-June, pp. 99–108). Association for Computing Machinery. https://doi.org/10.1145/50202.50213
Mendeley helps you to discover research relevant for your work.