SHAQ: Single Headed Attention with Quasi-recurrence

0Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Natural Language Processing research has recently been dominated by large scale transformer models. Although they achieve state of the art on many important language tasks, transformers often require expensive compute resources, and days spanning to weeks to train. This is feasible for researchers at big tech companies and leading research universities, but not for scrappy start-up founders, students, and independent researchers. Stephen Merity’s SHA-RNN, a compact, hybrid attention-RNN model, is designed for consumer-grade modeling as it requires significantly fewer parameters and less training time to reach near state of the art results. We analyze Merity’s model here through an exploratory model analysis over several units of the architecture considering both training time and overall quality in our assessment. Ultimately, we combine these findings into a new architecture which we call SHAQ: Single Headed Attention Quasi-recurrent Neural Network. With our new architecture we achieved similar accuracy results as the SHA-RNN while accomplishing a 4x speed boost in training.

Cite

CITATION STYLE

APA

Dandona, S., Kushner, W., Bharwani, N., & Schreiber, B. (2022). SHAQ: Single Headed Attention with Quasi-recurrence. In Lecture Notes in Networks and Systems (Vol. 507 LNNS, pp. 548–563). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-10464-0_37

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free