Given a video and associated text, we propose an automatic annotation scheme in which we employ a latent topic model to generate topic distributions from weighted text and then modify these distributions based on visual similarity. We apply this scheme to location annotation of a television series for which transcripts are available. The topic distributions allow us to avoid explicit classification, which is useful in cases where the exact number of locations is unknown. Moreover, many locations are unique to a single episode, making it impossible to obtain representative training data for a supervised approach. Our method first segments the episode into scenes by fusing cues from both images and text. We then assign location-oriented weights to the text and generate topic distributions for each scene using Latent Dirichlet Allocation. Finally, we update the topic distributions using the distributions of visually similar scenes. We formulate our visual similarity between scenes as an Earth Mover's Distance problem. We quantitatively validate our multi-modal approach to segmentation and qualitatively evaluate the resulting location annotations. Our results demonstrate that we are able to generate accurate annotations, even for locations only seen in a single episode. © 2010. The copyright of this document resides with its authors.
CITATION STYLE
Engels, C., Deschacht, K., Becker, J. H., Tuytelaars, T., Moens, M. F., & Van Gool, L. (2010). Automatic annotation of unique locations from video and text. In British Machine Vision Conference, BMVC 2010 - Proceedings. British Machine Vision Association, BMVA. https://doi.org/10.5244/C.24.115
Mendeley helps you to discover research relevant for your work.