Data indexing is common in data mining when working with high-dimensional, large-scale data sets. Hadoop, a cloud computing project using the MapReduce framework in Java, has become of significant interest in distributed data mining. To resolve problems of globalization, random-write and duration in Hadoop, a data indexing approach on Hadoop using the Java Persistence API (JPA) is elaborated in the implementation of a KD-tree algorithm on Hadoop. An improved intersection algorithm for distributed data indexing on Hadoop is proposed, it performs O(M+logN), and is suitable for occasions of multiple intersections. We compare the data indexing algorithm on open dataset and synthetic dataset in a modest cloud environment. The results show the algorithms are feasible in large-scale data mining. © 2010 IFIP.
CITATION STYLE
Lai, Y., & Zhongzhi, S. (2010). An efficient data indexing approach on Hadoop using Java persistence API. In IFIP Advances in Information and Communication Technology (Vol. 340 AICT, pp. 213–224). Springer Science and Business Media, LLC. https://doi.org/10.1007/978-3-642-16327-2_27
Mendeley helps you to discover research relevant for your work.