GEM: A General Evaluation Benchmark for Multimodal Tasks

Lin Su; Nan Duan; Edward Cui; Lei Ji; Chenfei Wu; Huaishao Luo; Yongfei Liu; Ming Zhong; Taroon Bharti; Arun Sacheti

Conference Proceedings

GEM: A General Evaluation Benchmark for Multimodal Tasks

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021) 2594-2603

DOI: 10.18653/v1/2021.findings-acl.229

6Citations

69Readers

Get full text

Abstract

In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019), XGLUE (Liang et al., 2020) and XTREME (Hu et al., 2020) that mainly focus on natural language tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO (Chen et al., 2015) and Flicker30K (Vinyals et al., 2015) for image-language tasks, YouCook2 (Zhou et al., 2018) and MSR-VTT (Xu et al., 2016) for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages. We also provide two baseline models for this benchmark. We will release the dataset, code and baseline models, aiming to advance the development of multilingual multimodal research.

Cite

CITATION STYLE

APA

Su, L., Duan, N., Cui, E., Ji, L., Wu, C., Luo, H., … Sacheti, A. (2021). GEM: A General Evaluation Benchmark for Multimodal Tasks. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 2594–2603). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.229

GEM: A General Evaluation Benchmark for Multimodal Tasks

Abstract

Cite

Register to see more suggestions