In this chapter, we are interested in the open problem of meaningful object recognition in video. Recently the approaches which estimate human visual attention and incorporate it into the whole visual content understanding process have become popular. In estimation of visual attention in a complex spatio-temporal content such as video one has to fuse multiple information channels such as motion, spatial contrast, and others. In the first part of the chapter, we are interested in these questions and report on optimal strategies of bottom-up fusion in visual saliency estimation. Then the estimated visual saliency is used in pooling of local descriptors. We compare different pooling approaches and show results on rather interesting visual content: that one recorded with wearable cameras for a large-scale research on Alzheimer’s disease. The results which will be shown together with conclusion demonstrate that the approaches based on the saliency fusion outperform the best state-of-the art techniques in this content.
CITATION STYLE
González-Díaz, I., Benois-Pineau, J., Buso, V., & Boujut, H. (2014). Fusion of multiple visual cues for object recognition in videos. In Advances in Computer Vision and Pattern Recognition (Vol. 64, pp. 79–107). Springer-Verlag London Ltd. https://doi.org/10.1007/978-3-319-05696-8_4
Mendeley helps you to discover research relevant for your work.