Abstract
The goal of this work is segmenting on a video sequence the objects which are mentioned in a linguistic description of the scene. We have adapted an existing deep neural network that achieves state of the art performance in semi-supervised video object segmentation, to add a linguistic branch that would generate an attention map over the video frames, making the segmentation of the objects temporally consistent along the sequence.
Author supplied keywords
Cite
CITATION STYLE
Herrera-Palacio, A., Ventura, C., & Giro-I-Nieto, X. (2019). Video object linguistic grounding. In MULEA 2019 - 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications, co-located with MM 2019 (pp. 49–51). Association for Computing Machinery, Inc. https://doi.org/10.1145/3347450.3357662
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.