FNet with Cross-Attention Encoder for Visual Question Answering

0Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Visual question answering (VQA) is a challenging research area where model needs to understand image semantics along with the asked question in order to give a correct answer. Lately transformers showed improvement to the performance of deep learning models than using traditional sequence to sequence models like LSTMs, RNNs. Transformer models use attention mechanism and using it with complex models requires longer time and huge resources for training. In this paper, VQA task is accomplished using three transformer encoders where transformer’s self attention sub-layers are replaced with Fourier transforms which is called FNet that mixes input tokens in question and image encoders. Self attention is used only in the cross modality encoder to enhance accuracy. Experiment is done on two phases: Firstly, Pre-training is done on a subset of LXMERT dataset (5.99% of LXMERT’s instances)due to resources limitations and the Second phase is fine tuning on VQA v.2 dataset. It results in 24% faster Pre-training time but testing accuracy decreased by 5.61% than using encoders with BERT self attention only in all sub-layers. Model is also pre-trained using FNet sub-layers only and it trained faster by 30.6% than using only BERT self-attention sub-layers but got lower testing accuracy result (48.79%).

Cite

CITATION STYLE

APA

Zekrallah, S. I., Khalifa, N. E. D., & Hassanin, A. E. (2023). FNet with Cross-Attention Encoder for Visual Question Answering. In Lecture Notes on Data Engineering and Communications Technologies (Vol. 152, pp. 602–611). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-20601-6_49

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free