Intrusion detection is an essential task for protecting the cyber environment from attacks. Many studies have proposed sophisticated models to detect intrusions from a large amount of data, yet they ignored the fact that poor data quality has a direct impact on the performance of intrusion detection systems. Examples of poor data quality include mislabeled, inaccurate, incomprehensive, irrelevant, inconsistent, duplicated, and overlapped data. In order to investigate how data quality may affect machine learning performance, we conducted a series of experiments on 11 host-based intrusion datasets using eight machine learning (ML) models and two pre-trained language models BERT and GPT-2. The experimental results showed: 1. BERT and GPT-2 outperformed the other models on every dataset. 2. Data duplications and overlaps in a dataset had different performance impacts on the pre-trained models and the classic ML models. The pre-trained models were less susceptible to duplicate and overlapped data than the classic ML models. 3. Removing overlaps and duplicates from training data with a normal range of sequence similarities could improve the pre-trained models' performances on most datasets. However, it may have adverse effects on model performance in datasets with highly similar sequences. 4. The reliability of model evaluation could be affected when testing data contains duplicates. 5. The overlapped rate between the normal class and the intrusion class seemed to have an inverse relationship to the performance of the pre-trained models in intrusion detection. Given the results, we proposed a framework for model selection and data quality assurance for building a high-quality machine learning-based intrusion detection system.
CITATION STYLE
Tran, N., Chen, H., Bhuyan, J., & Ding, J. (2022). Data Curation and Quality Evaluation for Machine Learning-Based Cyber Intrusion Detection. IEEE Access, 10, 121900–121923. https://doi.org/10.1109/ACCESS.2022.3211313
Mendeley helps you to discover research relevant for your work.