Context Compression for Auto-regressive Transformers with Sentinel Tokens

Siyu Ren; Qi Jia; Kenny Q. Zhu

Conference ProceedingsOPEN ACCESS

Context Compression for Auto-regressive Transformers with Sentinel Tokens

EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (2023) 12860-12867

DOI: 10.18653/v1/2023.emnlp-main.794

4Citations

10Readers

Abstract

The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe issues on memory footprint and inference latency. In this work, we propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones, thereby reducing both memory and computational cost when processing subsequent context. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach over sparse attention baselines in terms of fluency, n-gram matching, and semantic similarity. At last, we comprehensively profile the benefit of context compression on improving the system throughout. Code is available at https://github.com/DRSY/KV_Compression.

Cite

CITATION STYLE

APA

Ren, S., Jia, Q., & Zhu, K. Q. (2023). Context Compression for Auto-regressive Transformers with Sentinel Tokens. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 12860–12867). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.794

Context Compression for Auto-regressive Transformers with Sentinel Tokens

Abstract

Cite

Register to see more suggestions