The proposed system aims to overcome challenges posed by large databases, data imbalance, heterogeneity, and multidimensionality through progressive sampling as a novel classification model. It leverages sampling techniques to enhance processing performance and overcome memory restrictions. The random forest regressor feature importance technique with the gini significance method is employed to identify important characteristics, reducing the data’s features for classification. The system utilizes diverse classifiers such as random forest, ensemble learning, support vector machine (SVM), k-nearest neighbors’ algorithm (KNN), and logistic regression, allowing flexibility in handling different data types and achieving high accuracy in classification tasks. By iteratively applying progressive sampling to the dataset with the best features, the proposed technique aims to significantly improve performance compared to using the entire dataset. This approach focuses computational resources on the most informative subsets of data, reducing time complexity. Results show that the system can achieve over 85% accuracy even with only 5-10% of the original data size, providing accurate predictions while reducing data processing requirements. In conclusion, the proposed system combines progressive sampling, feature selection using random forest regressor feature importance (RFRFI-PS), and a range of classifiers to address challenges in large databases and improve classification accuracy. It demonstrates promising results in accuracy and time complexity reduction.
CITATION STYLE
Bangera, N., Kayarvizhy, Luharuka, S., & Manek, A. S. (2024). Improving time efficiency in big data through progressive sampling-based classification model. Indonesian Journal of Electrical Engineering and Computer Science, 33(1), 248–260. https://doi.org/10.11591/ijeecs.v33.i1.pp248-260
Mendeley helps you to discover research relevant for your work.