In this paper, we propose Strark-H, a storage and query strategy for large-scale spatial data based on Spark, to improve the response speed of spatial query by considering the spatial location and category keywords of spatial objects. Firstly, we define a custom InputFormat class to make spark natively understand the content of Shapefile, which is a common file format to store spatial data. Then, we put forward a partition and indexing method for spatial storage, based on which spatial data is partitioned unevenly according to the spatial position, which ensures the size of each partition does not exceed the block in HDFS and preserve the spatial proximity of spatial objects in the cluster. Moreover, a secondary index is generated, including global index based on spatial position for all partitions as well as local index based on category of spatial objects. Finally, we design a new data loading and query scheme based on Strark-H for spatial queries including range query, K-NN query and spatial join query. Extensive experiments on OSM show that Strark-H can be applied to Spark to natively support spatial query and storage with efficiency and scalability.
CITATION STYLE
Zou, W., Jing, W., Chen, G., & Lu, Y. (2020). Strark-H: A Strategy for Spatial Data Storage to Improve Query Efficiency Based on Spark. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11944 LNCS, pp. 285–299). Springer. https://doi.org/10.1007/978-3-030-38991-8_19
Mendeley helps you to discover research relevant for your work.