BlockMatch: A Fine-Grained Binary Code Similarity Detection Approach Using Contrastive Learning for Basic Block Matching

1Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

Binary code similarity detection (BCSD) plays a vital role in computer security and software engineering. Traditional BCSD methods heavily rely on specific features and necessitate rich expert knowledge, which are sensitive to code alterations. To improve the robustness against minor code alterations, recent research has shifted towards machine learning-based approaches. However, existing BCSD approaches mainly focus on function-level matching and face challenges related to large batch optimization and high quality sample selection at the basic block level. To overcome these challenges, we propose BlockMatch, a novel fine-grained BCSD approach that leverages natural language processing (NLP) techniques and contrastive learning for basic block matching. We treat instructions of basic blocks as a language and utilize a DeBERTa model to capture relative position relations and contextual semantics for encoding instruction sequences. For various operands in binary code, we propose a root operand model pre-training task to mitigate semantic missing of unseen operands. We then employ a mean pooling layer to generate basic block embeddings for detecting binary code similarity. Additionally, we propose a contrastive training framework, including a block augmentation model to generate high-quality training samples, improving the effectiveness of model training. Inspired by contrastive learning, we adopt the NT-Xent loss as our objective function, which allows larger sample sizes for model training and mitigates the convergence issues caused by limited local positive/negative samples. By conducting extensive experiments, we evaluate BlockMatch and compare it against state-of-the-art approaches such as PalmTree and SAFE. The results demonstrate that BlockMatch achieves a recall@1 of 0.912 at the basic block level under the cross-compiler scenario (pool size = 10), which outperforms PalmTree (0.810) and SAFE (0.798). Furthermore, our ablation study shows that the proposed contrastive training framework and root operand model pre-training task help our model achieve superior performance.

Cite

CITATION STYLE

APA

Luo, Z., Wang, P., Xie, W., Zhou, X., & Wang, B. (2023). BlockMatch: A Fine-Grained Binary Code Similarity Detection Approach Using Contrastive Learning for Basic Block Matching. Applied Sciences (Switzerland), 13(23). https://doi.org/10.3390/app132312751

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free