Differentiable subset pruning of transformer heads

33Citations
Citations of this article
67Readers
Mendeley users who have this article in their library.

Abstract

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer’s multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns perhead importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.1.

References Powered by Scopus

EIE: Efficient Inference Engine on Compressed Deep Neural Network

2039Citations
N/AReaders
Get full text

Channel Pruning for Accelerating Very Deep Neural Networks

2012Citations
N/AReaders
Get full text

Learning Efficient Convolutional Networks through Network Slimming

1950Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Structured Pruning Learns Compact and Accurate Models

98Citations
N/AReaders
Get full text

X-Pruner: eXplainable Pruning for Vision Transformers

31Citations
N/AReaders
Get full text

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

10Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Li, J., Cotterell, R., & Sachan, M. (2021). Differentiable subset pruning of transformer heads. Transactions of the Association for Computational Linguistics, 9, 1442–1459. https://doi.org/10.1162/tacl_a_00436

Readers over time

‘21‘22‘23‘24‘2508162432

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 10

59%

Researcher 4

24%

Lecturer / Post doc 2

12%

Professor / Associate Prof. 1

6%

Readers' Discipline

Tooltip

Computer Science 19

79%

Linguistics 3

13%

Neuroscience 1

4%

Engineering 1

4%

Save time finding and organizing research with Mendeley

Sign up for free
0