Current research shows that the detection of semantic concepts (e.g., animal, bus, person, dancing, etc.) in multimedia documents such as videos, requires the use of several types of complementary descriptors in order to achieve good results. In this work, we explore strategies for combining dozens of complementary content descriptors (or “experts”) in an efficient way, through the use of late fusion approaches, for concept detection in multimedia documents. We explore two fusion approaches that share a common structure: both start with a clustering of experts stage, continue with an intra-cluster fusion and finish with an inter-cluster fusion, and we also experiment with other state-of-the-art methods. The first fusion approach relies on a priori knowledge about the internals of each expert to group the set of available experts by similarity. The second approach automatically obtains measures on the similarity of experts from their output to group the experts using agglomerative clustering, and then combines the results of this fusion with those from other methods. In the end, we show that an additional performance boost can be obtained by also considering the context of multimedia elements.
CITATION STYLE
Strat, S. T., Benoit, A., Lambert, P., Bredin, H., & Quénot, G. (2014). Hierarchical late fusion for concept detection in videos. In Advances in Computer Vision and Pattern Recognition (Vol. 64, pp. 53–77). Springer London. https://doi.org/10.1007/978-3-319-05696-8_3
Mendeley helps you to discover research relevant for your work.