Multimodal semantic comprehension has attracted increasing research interests in recent years, such as visual question answering and caption generation. However, due to the data limitation, fine-grained semantic comprehension which requires to capture semantic details of multimodal contents has not been well investigated. In this work, we introduce “YouMakeup”, a large-scale multimodal instructional video dataset to support fine-grained semantic comprehension research in specific domain. YouMakeup contains 2,800 videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of natural language descriptions for instructional steps, grounded in temporal video range and spatial facial areas. The annotated steps in a video involve subtle difference in actions, products and regions, which require fine-grained understanding and reasoning both temporally and spatially. In order to evaluate models' ability for fined-grained comprehension, we further propose two groups of tasks including generation tasks and visual question answering tasks from different aspects. We also establish a baseline of step caption generation for future comparison. The dataset will be publicly available at https://github.com/AIM3-RUC/YouMakeup to support research investigation in fine-grained semantic comprehension.
CITATION STYLE
Wang, W., Wang, Y., Chen, S., & Jin, Q. (2019). YoumakeUp: A large-scale domain-specific multimodal dataset for fine-grained semantic comprehension. In EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 5133–5143). Association for Computational Linguistics. https://doi.org/10.18653/v1/d19-1517
Mendeley helps you to discover research relevant for your work.