In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.
CITATION STYLE
Guo, Q., Qiu, X., Liu, P., Xue, X., & Zhang, Z. (2020). Multi-scale self-attention for text classification. In AAAI 2020 - 34th AAAI Conference on Artificial Intelligence (pp. 7847–7854). AAAI press. https://doi.org/10.1609/aaai.v34i05.6290
Mendeley helps you to discover research relevant for your work.