Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a fruitful area focused on processing text of various natural languages. We notice that binary code analysis and NLP share many analogical topics, such as semantics extraction, classification, and code/text comparison. This work thus borrows ideas from NLP to address two important code similarity comparison problems. (I) Given a pair of basic blocks of different instruction set architectures (ISAs), determining whether their semantics is similar; and (II) given a piece of code of interest, determining if it is contained in another piece of code of a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. Despite the evident importance of Problem I, existing solutions are either inefficient or imprecise. Inspired by Neural Machine Translation (NMT), which is a new approach that tackles text across natural languages very well, we regard instructions as words and basic blocks as sentences, and propose a novel cross-(assembly)-lingual deep learning approach to solving Problem I, attaining high efficiency and precision. Many solutions have been proposed to determine whether two pieces of code, e.g., functions, are equivalent (called the equivalence problem), which is different from Problem II (called the containment problem). Resolving the cross-architecture code containment problem is a new and more challenging endeavor. Employing our technique for cross-architecture basic-block comparison, we propose the first solution to Problem II. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. The case studies applying the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.
CITATION STYLE
Zuo, F., Li, X., Young, P., Luo, L., Zeng, Q., & Zhang, Z. (2019). Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019. The Internet Society. https://doi.org/10.14722/ndss.2019.23492
Mendeley helps you to discover research relevant for your work.