Most general-purpose distributed storage systems are not designed with near data processing (NDP) in mind. They do not respect semantic data boundaries when writing data, for example splitting a record across servers. This reduces NDP effectiveness by requiring data collation before computation. While semantic data awareness and NDP functions can be retroactively added to existing distributed storage, it is often complex and difficult to accomplish in practice. We propose sharing storage system layout information with data writers so they can adjust data layouts to prevent data alignment issues regardless of the underlying architectures. By doing so, we can simplify NDP implementation by reducing the need for data reassembly, and reduce the need for complex storage system or application extensions. We demonstrate a hinting mechanism on both HDFS with computational block storage and an erasure coded MinIO deployment, reducing data movement by up to 99% when querying CSV data with NDP co-located with the stored data. This was accomplished purely with client side data alignment, no modifications to the server side write paths, and no inter-node collation of data.
CITATION STYLE
Adams, I. F., Agrawal, N., & Mesnier, M. P. (2021). Enabling near-data processing in distributed object storage systems. In HotStorage 2021 - Proceedings of the 13th ACM Workshop on Hot Topics in Storage and File Systems (pp. 28–34). Association for Computing Machinery, Inc. https://doi.org/10.1145/3465332.3470881
Mendeley helps you to discover research relevant for your work.