Multi-scale self-attention for text classification

54Citations
Citations of this article
112Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.

Cite

CITATION STYLE

APA

Guo, Q., Qiu, X., Liu, P., Xue, X., & Zhang, Z. (2020). Multi-scale self-attention for text classification. In AAAI 2020 - 34th AAAI Conference on Artificial Intelligence (pp. 7847–7854). AAAI press. https://doi.org/10.1609/aaai.v34i05.6290

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free