Malware detection on highly imbalanced data through sequence modeling

Rajvardhan Oak; Min Du; David Yan; Harshvardhan Takawale; Idan Amit

Conference Proceedings

Malware detection on highly imbalanced data through sequence modeling

Proceedings of the ACM Conference on Computer and Communications Security (2019) 37-48

DOI: 10.1145/3338501.3357374

78Citations

96Readers

Get full text

Abstract

We explore the task of Android malware detection based on dynamic analysis of application activity sequences using deep learning techniques. We show that analyzing a sequence of the activities is informative for detecting malware, but that analyzing longer sequences does not necessarily lead to a more accurate model. In the real-world scenario, the number of malware is low compared to that of harmless applications. Our dataset has more than 180,000 samples, two-thirds of which are malware. This dataset is significantly larger than other datasets used in previous studies. We mimic real-world cases by randomly sampling a small portion of malware samples. Using the state-of-the-art model BERT, we show that it is possible to achieve desired malware detection performance with an extremely unbalanced dataset. We find that our BERT based model achieves an F1 score of 0.919 with just 0.5% of the examples being malware, which significantly outperforms current state-of-the-art approaches. The results validate the effectiveness of our proposed method in dealing with highly imbalanced datasets.

Author supplied keywords

Cite

CITATION STYLE

APA

Oak, R., Du, M., Yan, D., Takawale, H., & Amit, I. (2019). Malware detection on highly imbalanced data through sequence modeling. In Proceedings of the ACM Conference on Computer and Communications Security (pp. 37–48). Association for Computing Machinery. https://doi.org/10.1145/3338501.3357374

Malware detection on highly imbalanced data through sequence modeling

Abstract

Author supplied keywords

Cite

Register to see more suggestions