Combining CNN and Transformer as Encoder to Improve End-to-End Handwritten Mathematical Expression Recognition Accuracy

2Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The attention-based encoder-decoder (AED) models are increasingly used in handwritten mathematical expression recognition (HMER) tasks. Given the recent success of Transformer in computer vision and a variety of attempts to combine Transformer with convolutional neural network (CNN), in this paper, we study 3 ways of leveraging Transformer and CNN designs to improve AED-based HMER models: 1) Tandem way, which feeds CNN-extracted features to a Transformer encoder to capture global dependencies; 2) Parallel way, which adds a Transformer encoder branch taking raw image patches as input and concatenates its output with CNN’s as final feature; 3) Mixing way, which replaces convolution layers of CNN’s last stage with multi-head self-attention (MHSA). We compared these 3 methods on the CROHME benchmark. On CROHME 2016 and 2019, Tandem way attained the ExpRate of 54.85% and 58.56%, respectively; Parallel way attained the ExpRate of 55.63% and 57.39%; and Mixing way achieved the ExpRate of 53.93% and 55.64%. This result indicates that Parallel and Tandem ways perform better than Mixing way, and have little difference between each other.

Cite

CITATION STYLE

APA

Zhang, Z., & Zhang, Y. (2022). Combining CNN and Transformer as Encoder to Improve End-to-End Handwritten Mathematical Expression Recognition Accuracy. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13639 LNCS, pp. 185–197). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-21648-0_13

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free