Recent state-of-the-art scene text recognition methods are primarily based on Recurrent Neural Networks (RNNs), however, these methods require one-dimensional (1D) features and are not designed for recognizing irregular-text instances due to the loss of spatial information present in the original two-dimensional (2D) images. In this paper, we leverage a Transformer-based architecture for recognizing both regular and irregular text-in-the-wild images. The proposed method takes advantage of using a 2D positional encoder with the Transformer architecture to better preserve the spatial information of 2D image features than previous methods. The experiments on popular benchmarks, including the challenging COCO-Text dataset, demonstrate that the proposed scene text recognition method outperformed the state-of-the-art in most cases, especially on irregular-text recognition.
CITATION STYLE
Raisi, Z., Naiel, M. A., Fieguth, P., Wardell, S., & Zelek, J. (2021). 2D Positional Embedding-based Transformer for Scene Text Recognition. Journal of Computational Vision and Imaging Systems, 6(1), 1–4. https://doi.org/10.15353/jcvis.v6i1.3533
Mendeley helps you to discover research relevant for your work.