Visual question answering (VQA) is a challenging research area where model needs to understand image semantics along with the asked question in order to give a correct answer. Lately transformers showed improvement to the performance of deep learning models than using traditional sequence to sequence models like LSTMs, RNNs. Transformer models use attention mechanism and using it with complex models requires longer time and huge resources for training. In this paper, VQA task is accomplished using three transformer encoders where transformer’s self attention sub-layers are replaced with Fourier transforms which is called FNet that mixes input tokens in question and image encoders. Self attention is used only in the cross modality encoder to enhance accuracy. Experiment is done on two phases: Firstly, Pre-training is done on a subset of LXMERT dataset (5.99% of LXMERT’s instances)due to resources limitations and the Second phase is fine tuning on VQA v.2 dataset. It results in 24% faster Pre-training time but testing accuracy decreased by 5.61% than using encoders with BERT self attention only in all sub-layers. Model is also pre-trained using FNet sub-layers only and it trained faster by 30.6% than using only BERT self-attention sub-layers but got lower testing accuracy result (48.79%).
CITATION STYLE
Zekrallah, S. I., Khalifa, N. E. D., & Hassanin, A. E. (2023). FNet with Cross-Attention Encoder for Visual Question Answering. In Lecture Notes on Data Engineering and Communications Technologies (Vol. 152, pp. 602–611). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-20601-6_49
Mendeley helps you to discover research relevant for your work.