Abstract
Given a word in context, the task of Visual Word Sense Disambiguation consists of selecting the correct image among a set of candidates. To select the correct image, we propose a solution blending text augmentation and multimodal models. Text augmentation leverages the fine-grained semantic annotation from WordNet to get a better representation of the textual component. We then compare this sense-augmented text to the set of image using pre-trained multimodal models CLIP and ViLT. Our system has been ranked 16th for the English language, achieving 68.5 points for hit rate and 79.2 for mean reciprocal rank. The code to this project is available on Github1
Cite
CITATION STYLE
Zhang, S., Nath, S., & Mazzaccara, D. (2023). GPL at SemEval-2023 Task 1: WordNet and CLIP to Disambiguate Images. In 17th International Workshop on Semantic Evaluation, SemEval 2023 - Proceedings of the Workshop (pp. 1592–1597). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.semeval-1.219
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.