This paper describes our system participated in Task 6 of SemEval-2021: this task focuses on multimodal propaganda technique classification and it aims to classify given image and text into 22 classes. In this paper, we propose to use transformer-based (Vaswani et al., 2017) architecture to fuse the clues from both image and text. We explore two branches of techniques including fine-tuning the text pre-trained transformer with extended visual features and fine-tuning the multimodal pre-trained transformers. For the visual features, we experiment with both grid features extracted from ResNet(He et al., 2016) network and salient region features from a pre-trained object detector. Among the pre-trained multimodal transformers, we choose ERNIE-ViL (Yu et al., 2020), a two-steam cross-attended transformers model pre-trained on large-scale image-caption aligned data. Fine-tuning ERNIE-ViL for our task produces a better performance due to general joint multimodal representation for text and image learned by ERNIE-ViL. Besides, as the distribution of the classification labels is extremely unbalanced, we also make a further attempt on the loss function and the experiment results show that focal loss would perform better than cross-entropy loss. Lastly, we ranked first place at sub-task C in the final competition.
CITATION STYLE
Feng, Z., Tang, J., Liu, J., Yin, W., Feng, S., Sun, Y., & Chen, L. (2021). Alpha at SemEval-2021 Task 6: Transformer Based Propaganda Classification. In SemEval 2021 - 15th International Workshop on Semantic Evaluation, Proceedings of the Workshop (pp. 99–104). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.semeval-1.8
Mendeley helps you to discover research relevant for your work.