Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

0Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.

Abstract

We investigate the effects of post-training quantization and quantization-aware training on the generalization of Transformer language models. We present a new method called self-distilled quantization (SDQ) that minimizes accumulative quantization errors and outperforms baselines. We apply SDQ to multilingual models XLM-RBase and InfoXLMBase and demonstrate that both models can be reduced from 32-bit floating point weights to 8-bit integer weights while maintaining a high level of performance on the XGLUE benchmark. Our results also highlight the challenges of quantizing multilingual models, which must generalize to languages they were not fine-tuned on.

Cite

CITATION STYLE

APA

O’ Neill, J., & Dutta, S. (2023). Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 2, pp. 1329–1339). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-short.114

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free