A Novel Approach for Parallelized Clustering Model by Using Hadoop Map-Reduce Framework

1Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The current generation of cellular devices have plenty of processing power and storage, but they lag behind in terms of huge data storage and processing software. "Big Data"is referred to huge volumes of unstructured data generated by high-performance applications from scientific computers to social networks, e-government to medical information systems, and so on. According to recent big data study, the amount of data continues to grow at an exponential rate. A strong computing paradigm like Hadoop and Map-Reduce is required to process enormous amounts of data on computing clusters. The Map-Reduce framework is a programming technique that allows you to handle terabytes of data in a fraction of the time. Big data needs good scheduling in order to attain outstanding results. The Scheduling Technique is used to decrease hunger, enhance resource utilisation, and distribute work to available resources.For the Hadoop Map-Reduce model, various scheduling algorithm has been developed, which differs widely in design, behavior, and handling various issues. Existing resource allocation scheduling do not consider the weight of each job which leads to unbalanced performance among nodes. We focus on several issues of Hadoop Map reduce and introduce novel mechanism to handle and process the Big data. The proposed approach is divided into three phases such as implementation of optimized clustering scheme, improving the clustering scheme by implementation them in parallel manner and finally, incorporating machine learning and Fuzzy logic based intelligent techniques to adapt the big data changes to efficiently process the huge data.In first phase, we develop improved P-DBSCAN algorithm which uses Mapper and reducer programs in such a way so that input data points will get into the cluster formulation. The proposed approach consists of several stages such as data partitioning stage, local clustering stage, data merger stage and global cluster generation stage as final outcome. Moreover, two cluster optimization methods are included to improve the clustering performance. DBSCAN parallelization options are examined using the Spark platform built to maximize memory consumption and iterative processes. Single-node Spark and Spark cluster platforms have historically used distinct resource managers for DBSCAN optimization.In second stage, we adopt the concept of Hierarchical Clustering Algorithm and developed a novel clustering scheme to overcome the issues of DBSCAN. This scheme uses mapper and reducer entities. Mapper receives its copy of the input dataset. Labeling activities are carried out by the Mapper, such as finding the closest centroid to an individual data object. The reducer creates a new value for each centroid based on the items allocated to it in the current iteration of the algorithm. This is determined by evaluating the average of all of the data objects in each cluster. Later, K-Medoid method proposed utilizing medoids as the center of clusters since they can be influenced by outliers, although this is a disadvantage. It aims to reduce the cost between the cluster's non-medoids and its medoid.Finally, we focus on semantic relationship extraction between queries where we adopted neuro-fuzzy based hybrid technique and TF-IDF algorithm. In this thesis, we developed a hybrid approach by using Fuzzy logic and neural network to mine high fuzzy utility patterns. New fuzzy rules are defined based on 6 different criteria such as Query Time, Query Length, Query Expiry, Total Queries, CPU Usages, and Task activity. First of all, we perform the data pre-processing where several tasks are performed to filter the data, later, these queries are processed through the scheduler where fuzzy rule based is used and semantic relationship is established among queries. Based on the TF-IDF approach weight are computed from filtered data. The average runtime performance of these schemes is obtained as 167.5s, 137.5s, 51.25s, 37.5s, and 25s by using Apriori (M), HFUPM, Apriori, and EFUPM, Proposed Approach, respectively.

Cite

CITATION STYLE

APA

Maithri, C., & Chandramouli, H. (2022). A Novel Approach for Parallelized Clustering Model by Using Hadoop Map-Reduce Framework. In 2022 IEEE International Conference for Women in Innovation, Technology and Entrepreneurship, ICWITE 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ICWITE57052.2022.10176209

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free