A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding

1Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Recently, vision Transformer (ViT) has attracted more and more attention, many works introduce the ViT into concrete vision tasks and achieve impressive performance. However, there are only a few works focused on the applications of the ViT for scene text recognition. This paper takes a further step and proposes a strong scene text recognizer with a fully ViT-based architecture. Specifically, we introduce multi-grained features into both the encoder and decoder. For the encoder, we adopt a two-stage ViT with different grained patches, where the first stage extracts extent visual features with 2D fine-grained patches and the second stage aims at the sequence of contextual features with 1D coarse-grained patches. The decoder integrates Connectionist Temporal Classification (CTC)-based and attention-based decoding, where the two decoding schemes introduce different grained features into the decoder and benefit from each other with a deep interaction. To improve the extraction of fine-grained features, we additionally explore self-supervised learning for text recognition with masked autoencoders. Furthermore, a focusing mechanism is proposed to let the model target the pixel reconstruction of the text area. Our proposed method achieves state-of-the-art or comparable accuracies on benchmarks of scene text recognition with a faster inference speed and nearly 50 % reduction of parameters compared with other recent works.

Cite

CITATION STYLE

APA

Qiao, Z., Ji, Z., Yuan, Y., & Bai, J. (2022). A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13639 LNCS, pp. 198–212). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-21648-0_14

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free