Adaptive semiparametric language models

58Citations
Citations of this article
127Readers
Mendeley users who have this article in their library.

Abstract

We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component in an integrated architecture. Our model uses extended short-term context by caching local hidden states—similar to transformer-XL—and global long-term memory by retrieving a set of nearest neighbor tokens at each timestep. We design a gating function to adaptively combine multiple information sources to make a prediction. This mechanism allows the model to use either local context, short-term memory, or long-term memory (or any combination of them) on an ad hoc basis depending on the context. Experiments on word-based and character-based language modeling datasets demonstrate the efficacy of our proposed method compared to strong baselines.

Cite

CITATION STYLE

APA

Yogatama, D., D’autume, C. de M., & Kong, L. (2021). Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics, 9, 362–373. https://doi.org/10.1162/tacl_a_00371

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free