Abstract
In this paper, we address video instance segmentation using a new generative model that learns effective representations of the target and background appearance. We propose to exploit hierarchical structural embedding over spatio-temporal space, which is compact, powerful, and flexible in contrast to current tracking-by-detection methods. Specifically, our model segments and tracks instances across space and time in a single forward pass, which is formulated as hierarchical embedding learning. The model is trained to locate the pixels belonging to specific instances over a video clip. We firstly take advantage of a novel mixing function to better fuse spatio-temporal embeddings. Moreover, we introduce normalizing flows to further improve the robustness of the learned appearance embedding, which theoretically extends conventional generative flows to a factorized conditional scheme. Comprehensive experiments on the video instance segmentation benchmark, i.e., YouTube-VIS, demonstrate the effectiveness of the proposed approach. Furthermore, we evaluate our method on an unsupervised video object segmentation dataset to demonstrate its generalizability.
Author supplied keywords
Cite
CITATION STYLE
Qin, Z., Lu, X., Nie, X., Zhen, X., & Yin, Y. (2021). Learning Hierarchical Embedding for Video Instance Segmentation. In MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (pp. 1884–1892). Association for Computing Machinery, Inc. https://doi.org/10.1145/3474085.3475342
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.