Abstract
This paper asks what predictive processing models of brain function can learn from the success of transformer architectures. We suggest that the reason transformer architectures have been successful is that they implicitly commit to a non-Markovian generative model–in which we need memory to contextualize our current observations and make predictions about the future. Interestingly, both the notions of working memory in cognitive science and transformer architectures rely heavily upon the concept of attention. We will argue that the move beyond Markov is crucial in the construction of generative models capable of dealing with much of the sequential data–and certainly language–that our brains contend with. We characterize two broad approaches to this problem–deep temporal hierarchies and autoregressive models–with transformers being an example of the latter. Our key conclusions are that transformers benefit heavily from their use of embedding spaces that place strong metric priors on an implicit latent variable and utilize this metric to direct a form of attention that highlights the most relevant, and not only the most recent, previous elements in a sequence to help predict the next.
Author supplied keywords
Cite
CITATION STYLE
Parr, T., Pezzulo, G., & Friston, K. (2025). Beyond Markov: Transformers, memory, and attention. Cognitive Neuroscience. Routledge. https://doi.org/10.1080/17588928.2025.2484485
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.