Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased?

9Citations
Citations of this article
26Readers
Mendeley users who have this article in their library.

Abstract

Android security has received a lot of attention over the last decade, especially malware investigation. Researchers attempt to highlight applications' security-relevant characteristics to better understand malware and effectively distinguish malware from benign applications. The accuracy and the completeness of their proposals are evaluated experimentally on malware and goodware datasets. Thus, the quality of these datasets is of critical importance: if the datasets are outdated or not representative of the studied population, the conclusions may be flawed. We specify different types of experimental scenarios. Some of them require unlabeled but representative datasets of the entire population. Others require datasets labeled with valuable characteristics that may be difficult to compute, such as malware datasets. We discuss the irregularities of datasets used in experiments, questioning the validity of the performances reported in the literature. This article focuses on providing guidelines for designing debiased datasets. First, we propose guidelines for building representative datasets from unlabeled ones. Second, we propose and experiment a debiasing algorithm that, given a biased labeled dataset and a target representative dataset, builds a representative and labeled dataset. Finally, from the previous debiased datasets, we produce datasets for experiments on Android malware detection or classification with machine learning algorithms. Experiments show that debiased datasets perform better when classifying with machine learning algorithms.

Author supplied keywords

Cite

CITATION STYLE

APA

Miranda, T. C., Gimenez, P. F., Lalande, J. F., Tong, V. V. T., & Wilke, P. (2022). Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased? IEEE Transactions on Information Forensics and Security, 17, 2182–2197. https://doi.org/10.1109/TIFS.2022.3180184

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free