Challenges of Image and Video Retrieval

  • Lew M
  • Sebe N
  • Eakins J
N/ACitations
Citations of this article
28Readers
Mendeley users who have this article in their library.
Get full text

Abstract

What use is the sum of human knowledge if nothing can be found? Although significant advances have been made in text searching, only preliminary work has been done in finding images and videos in large digital collections. In fact, if we examine the most frequently used image and video retrieval systems (i.e. www.google.com) we find that they are typically oriented around text searches where manual annotation was already performed. Image and video retrieval is a young field which has its genealogy rooted in artificial intelligence, digital signal processing, statistics, natural language understanding, databases, psychology, computer vision, and pattern recognition. However, none of these parental fields alone has been able to directly solve the retrieval problem. Indeed, image and video retrieval lies at the intersections and crossroads between the parental fields. It is these curious intersections which appear to be the most promising. What are the main challenges in image and video retrieval? We think the paramount challenge is bridging the semantic gap. By this we mean that low level features are easily measured and computed, but the starting point of the retrieval process is typically the high level query from a human. Translating or converting the question posed by a human to the low level features seen by the computer illustrates the problem in bridging the semantic gap. However, the semantic gap is not merely translating high level features to low level features. The essence of a semantic query is understanding the meaning behind the query. This can involve understanding both the intellectual and emo-tional sides of the human, not merely the distilled logical portion of the query but also the personal preferences and emotional subtones of the query and the preferential form of the results. In this proceedings, several papers [1][2][3][4][5][6][7][8] touch upon the se-mantic problem and give valuable insights into the current state of the art. Wang et al [1] propose the use of color-texture classification to generate a code-book which is used to segment images into regions. The content of a region is then characterize by its self-saliency which describes its perceptual importance. Bruijn and Lew [2] investigate multi-modal content-based browsing and search-ing methods for Peer2Peer retrieval systems. Their work targets the assumption that keyframes are more interesting when they contain people. Vendrig and Wor-ring [3] propose a system that allows character identification in movies. In order to achieve this, they relate visual content to names extracted from movie scripts. Denman et al [5] present the tools in a system for creating semantically mean-ingful summaries of broadcast Snooker footage. Their system parses the video sequence, identifies relevant camera views, and tracks ball movements. A similar approach presented by Kim et al [8] extracts semantic information from basket-ball videos based on audio-visual features. A semantic video retrieval approach using audio analysis is presented by Bakker and Lew [7] in which the audio can be automatically categorized into semantic categories such as explosions, music, speech, etc. A system for recognizing objects in video sequences is presented by Visser et al [6]. They use the Kalman filter to obtain segmented blobs from the video, classify the blobs using the probability ration test, and apply several dif-ferent temporal methods, which results in sequential classification methods over the video sequence containing the blob. An automated scene matching algorithm is presented by Schaffalitzky and Zisserman [4]. Their goal is to match images of the same 3D scene in a movie. Ruiz-del-Solar and Navarrete [9] present a content-based face retrieval system that uses self-organizing maps (SOMs) and user feedback. SOMs were also employed by Oh et al [10], Hussain et al [11], and Huang et al [12] for visual clustering. A ranking algorithm using dynamic cluster-ing for content-based image retrieval is proposed by Park et al [13]. A learning method using the AdaBoost algorithm and a k-nearest neighbor approach is proposed by Pickering et al [14] for video retrieval. An overview of challenges for content-based navigation of digital video is pre-sented by Smeaton [15]. The author presents the different ways in which video content can be used directly to support the navigation within large video li-braries and lists the challenges that still remain to be addressed in this area. An insight into the problems and challenges of retrieval of archival moving imagery via the Internet is presented by Enser and Sandom [16]. The authors conclude that the combination of limited CBIR functionality and lack of adherence to cataloging standards seriously limits the Internet's potential for providing en-hanced access to film and video-based cultural resources. Burke [17] describes a research project which applies Personal Construct Theory to individual user perceptions of photographs. This work presents a librarian viewpoint toward content-based image retrieval. A user-centric system for visualization and layout for content-based image retrieval and browsing is proposed by Tian et al [18].

Cite

CITATION STYLE

APA

Lew, M. S., Sebe, N., & Eakins, J. P. (2002). Challenges of Image and Video Retrieval (pp. 1–6). https://doi.org/10.1007/3-540-45479-9_1

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free