Research on Compressed Input Sequences Based on Compiler Tokenization

1Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.

Abstract

Current applications of large language models (LLMs) in the field of code intelligence face issues related to low tokenization efficiency. This results in longer token sequences for input to source code types, which leads to the waste of contextual resources for large models. Additionally, the existing LLM tokenization technology struggles to ensure the contextual synonymity of variables. To address these problems, we propose a compiler-based compressed input sequence method. We focus on using the compiler’s lexical analyzer for preliminary tokenization of the input statements, followed by tokenization and filtering through the large model’s tokenizer. This approach results in shorter, semantically clearer, and higher-quality embedded token sequences. Then, using a contextual dictionary, the reduced tokens can be restored to their original state in the output statements. The experimental results show that our compressed input sequence method can be run smoothly in code generation scenarios. Compared to the baseline model, the compiler-based tokenization method can reduce the input token count by 33.7%. This study provides new insights for the application of LLMs in the field of code intelligence.

Cite

CITATION STYLE

APA

Li, Z., & Lu, X. (2025). Research on Compressed Input Sequences Based on Compiler Tokenization. Information (Switzerland), 16(2). https://doi.org/10.3390/info16020073

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free