Research on Compressed Input Sequences Based on Compiler Tokenization

Zhe Li; Xinxi Lu

Journal ArticleOPEN ACCESS

Research on Compressed Input Sequences Based on Compiler Tokenization

Information (Switzerland) (2025) 16(2)

DOI: 10.3390/info16020073

1Citations

13Readers

Abstract

Current applications of large language models (LLMs) in the field of code intelligence face issues related to low tokenization efficiency. This results in longer token sequences for input to source code types, which leads to the waste of contextual resources for large models. Additionally, the existing LLM tokenization technology struggles to ensure the contextual synonymity of variables. To address these problems, we propose a compiler-based compressed input sequence method. We focus on using the compiler’s lexical analyzer for preliminary tokenization of the input statements, followed by tokenization and filtering through the large model’s tokenizer. This approach results in shorter, semantically clearer, and higher-quality embedded token sequences. Then, using a contextual dictionary, the reduced tokens can be restored to their original state in the output statements. The experimental results show that our compressed input sequence method can be run smoothly in code generation scenarios. Compared to the baseline model, the compiler-based tokenization method can reduce the input token count by 33.7%. This study provides new insights for the application of LLMs in the field of code intelligence.

Author supplied keywords

Cite

CITATION STYLE

APA

Li, Z., & Lu, X. (2025). Research on Compressed Input Sequences Based on Compiler Tokenization. Information (Switzerland), 16(2). https://doi.org/10.3390/info16020073

Research on Compressed Input Sequences Based on Compiler Tokenization

Abstract

Author supplied keywords

Cite

Register to see more suggestions