Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased?

Tomas Concepcion Miranda; Pierre Francois Gimenez; Jean Francois Lalande; Valerie Viet Triem Tong; Pierre Wilke

Journal ArticleOPEN ACCESS

Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased?

IEEE Transactions on Information Forensics and Security (2022) 17 2182-2197

DOI: 10.1109/TIFS.2022.3180184

9Citations

26Readers

Abstract

Android security has received a lot of attention over the last decade, especially malware investigation. Researchers attempt to highlight applications' security-relevant characteristics to better understand malware and effectively distinguish malware from benign applications. The accuracy and the completeness of their proposals are evaluated experimentally on malware and goodware datasets. Thus, the quality of these datasets is of critical importance: if the datasets are outdated or not representative of the studied population, the conclusions may be flawed. We specify different types of experimental scenarios. Some of them require unlabeled but representative datasets of the entire population. Others require datasets labeled with valuable characteristics that may be difficult to compute, such as malware datasets. We discuss the irregularities of datasets used in experiments, questioning the validity of the performances reported in the literature. This article focuses on providing guidelines for designing debiased datasets. First, we propose guidelines for building representative datasets from unlabeled ones. Second, we propose and experiment a debiasing algorithm that, given a biased labeled dataset and a target representative dataset, builds a representative and labeled dataset. Finally, from the previous debiased datasets, we produce datasets for experiments on Android malware detection or classification with machine learning algorithms. Experiments show that debiased datasets perform better when classifying with machine learning algorithms.

Author supplied keywords

Cite

CITATION STYLE

APA

Miranda, T. C., Gimenez, P. F., Lalande, J. F., Tong, V. V. T., & Wilke, P. (2022). Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased? IEEE Transactions on Information Forensics and Security, 17, 2182–2197. https://doi.org/10.1109/TIFS.2022.3180184

Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased?

Abstract

Author supplied keywords

Cite

Register to see more suggestions