Learning latent byte-level feature representation for malware detection

Mahmood Yousefi-Azar; Len Hamey; Vijay Varadharajan; Shiping Chen

Conference Proceedings

Learning latent byte-level feature representation for malware detection

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11304 LNCS 568-578

DOI: 10.1007/978-3-030-04212-7_50

6Citations

8Readers

Get full text

Abstract

This paper proposes two different byte level feature representations of binary files for malware detection. The proposed static feature representations do not need any third-party tools and are independent of the operating system because they operate on the raw file bytes. Sparse term-frequency simhashing (s-tf-simhashing) is a faster type of tf-simhashing. S-tf-simhashing requires less computation and outperforms the original dense tf-simhashing. The binary word2vec (Bword2vec) representation embeds the semantic relationships of the n-grams into the code vectors. Bword2vec employs a binary to word2vec representation that reduces the feature space dimension than s-tf-simhashing and thus further reducing the computation of the classifier. We show that the proposed techniques can successfully be used for both analyzing of full malware apps and infected files. The experiments are conducted on real Android and PDF malware datasets.

Author supplied keywords

Cite

CITATION STYLE

APA

Yousefi-Azar, M., Hamey, L., Varadharajan, V., & Chen, S. (2018). Learning latent byte-level feature representation for malware detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11304 LNCS, pp. 568–578). Springer Verlag. https://doi.org/10.1007/978-3-030-04212-7_50

Learning latent byte-level feature representation for malware detection

Abstract

Author supplied keywords

Cite

Register to see more suggestions