Using string information for malware family identification

Prasha Shrestha; Suraj Maharjan; Gabriela Ramírez de la Rosa; Alan Sprague; Thamar Solorio; Gary Warner

Journal Article

Using string information for malware family identification

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8864 686-697

DOI: 10.1007/978-3-319-12027-0_55

6Citations

7Readers

Get full text

Abstract

Classifying malware into correct families is an important task for anti-virus vendors. Currently, only some of them will recognize a particular malware. Even when they do, they either classify them into different families or use a generic family name, which does not provide much information. Our method for malware family identification is based on the observation that closely related malware have heavy overlap of strings. We first created two kinds of prototypes from printable strings in the malware: one using term frequency–inverse document frequency (tfidf) and the other using the prominent strings extracted from the vocabulary. We then used these prototypes for classification. We achieved an accuracy of 91.02% by considering the entire vocabulary and an accuracy of 80.52% by considering 20 prominent strings for each malware family. Our accuracy is high enough for our system to be used to classify even those malware that can confuse the anti-virus vendors.

Author supplied keywords

Cite

CITATION STYLE

APA

Shrestha, P., Maharjan, S., de la Rosa, G. R., Sprague, A., Solorio, T., & Warner, G. (2014). Using string information for malware family identification. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8864, 686–697. https://doi.org/10.1007/978-3-319-12027-0_55

Using string information for malware family identification

Abstract

Author supplied keywords

Cite

Register to see more suggestions