Introduction to semantic multimedia

2Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Recent progress in hardware and communication technologies has resulted in a rapid increase in the amount of multimedia information available to users. The usefulness of multimedia applications is largely determined by the accessibility of the content, so new challenges are emerging in terms of storing, transmitting, personalising, querying, indexing and retrieval of the multimedia content. Some examples of such challenges include access by business users to multimedia content needed for their work, access by consumers to entertainment content in their home or when mobile, and sharing of content by both professional and private content owners. Clearly, a description and deeper understanding of the information at the semantic level is required (Chang 2002) in order to efficiently meet the requirements resulting from these challenges. Attempts based on manual textual annotation, despite ensuring conceptual descriptions of a high level of abstraction, suffer from subjectivity of descriptions, thus creating a problem of interoperability and are extremely expensive in terms of labour and time resources. Hence, given the volume of information to deal with, it is imperative that the process of extracting (i.e. analysing) such semantic descriptions (i.e. annotations) takes place in an automatic manner, or with minimum human intervention. At the same time, interoperable description schemes must be used, since customised and application-dependent metadata description schemes do not ensure interoperability and reusability. Automatic techniques exploiting textual information associated with multimedia content (e.g. in Web pages or captions in video) can provide a solution only when such textual information exists, limited by the relevance of the text, by the efficiency of the linguistic analysis tools and, similarly with the manual case, by the subjectivity and polysemy of the description. The limitations in terms of automation of the analysis and description processes provoke research into the direct exploitation of content-related (e.g. visual or audio) information. Moving from low-level perceptual features to high-level semantic descriptions that match human cognition became the final frontier in computer vision, and consequently in any multimedia applications targeting efficient and effective access to, and manipulation of, the available content. The early efforts targeting this so-called semantic gap formed what are known as content-based (analysis and) retrieval approaches, where focus is on extracting the most representative numerical descriptions and defining similarity metrics that emulate the human notion of similarity. Whilst low-level descriptors, metrics and segmentation tools are fundamental building blocks of any multimedia content manipulation technique, they evidently fail to fully capture, by themselves, the semantics of the audiovisual medium; achieving the latter is a prerequisite for reaching the desired level of efficiency in content manipulation and retrieval. The limitations of such numerical-based methodologies, however, led to the investigation of ways to enhance their performance. Iterative approaches such as relevance feedback (Rui, Huang, Ortega andMehrotra 1998), which puts the user in the loop to "teach"the system what is required, and incremental learning (Naphade and Smith 2003), which uses rules to self-learn, are two common such enhancements. However, the developed systems still could not meet realistic user needs, although some have proved particularly effectivewithin certain application contexts, for example in Wu, Huang, Wang, Chiu and Chen (2007). As a result, research focus shifted to the exploitation of implicit and/or prior knowledge that could guide the process of analysis and semantics extraction. In other words, research efforts have concentrated on the semantic analysis of multimedia content, combining the aforementioned techniques with a priori domain-specific knowledge, so as to result in a high-level representation of multimedia content (Al-Khatib, Day, Ghafoor and Berra 1999). Domain-specific knowledge is utilised for guiding low-level feature extraction, high-level descriptor derivation and symbolic inference. Numerous approaches have been proposed building on this principle, exploiting various methods formodelling this knowledge, diverse representations and their consequent handling techniques. For example, see Tovinkere and Qian (2001) for description of a soccer events analysis method. Depending on the adopted knowledge acquisition and representation process, two types of approaches can be identified in the relevant literature: implicit, realised by machine learning methods, and explicit, realised by model-based approaches. The usage of machine learning techniques has proved to be a robust methodology for discovering complex relationships and interdependencies between numerical image data and the perceptually higher level concepts. Moreover, these elegantly handle problems of high dimensionality. Among the most commonly adopted machine learning techniques are neural networks (NNs), hidden Markov models (HMMs), Bayesian networks (BNs), support vector machines (SVMs) and genetic algorithms (GAs) (Assfalg, Berlini, Del Bimbo, Nunziat and Pala 2005; Russell and Norvig 2003). On the other hand, model-based image analysis approaches make use of prior knowledge in the form of explicitly defined facts, models and rules, i.e. they provide a coherent semantic domain model to support "visual"inference in the specified context (Dasiopoulou, Mezaris, Kompatsiaris, Papastathis and Strintzis 2005; Hollink, Little and Hunter 2005). These facts, models and rules may connect semantic concepts with other concepts, or with low-level visual features. The application of semantics to multimedia is further motivated by the need to ensure that the content can be used for current applications as well as be applicable to future applications. Explicit object models, rules and facts tie the application to a representation of the world which may become invalid in the future e.g. the presence of a typewriter as representing an office scene. Reference to standardised high-level semantics avoids the representation problem, and with appropriate abstraction, can support language independence. In addition to persistence of the relevance of the metadata, consistent metadata is needed to ensure that usable applications can be developed. For example, in content personalisation, where available multimedia content is selected for presentation to a user according to their preferences and interests, it is very important for the personalisation system to be able to accurately match the users requirements. This implies that the content should be suitably annotated such that there is no ambiguity, and its applicability can be determined by suitable automated processes. Clearly, standardised semantics are essential here, whereas informal, free-text annotations can lead to erroneous decisions. Furthermore, in the real world, objects exist in a context. Representing context for multimedia applications is a research issue of great importance (Edmonds 1999), affecting the quality of the produced results, especially in the fields of multimedia retrieval, personalisation and analysis. The latter can be defined as a tightly coupled and constant interaction between low-level image analysis algorithms and high-level knowledge representation (Athanasiadis, Tzouvaras, Petridis, Precioso, Avrithis and Kompatsiaris 2005), an area where the role of context is crucial. In recent years, a number of different context aspects related to image analysis have been studied, and a number of different approaches to model context representation have been proposed (Zhao, Shimazu, Ohta, Hayasaka and Matsushita 1996). As can be seen, there is a need for knowledge representation and processing in many multimedia applications or parts of the whole multimedia value chain (Fig. 1.1). This has led to an increasing convergence of research in the multimedia and knowledge domains, which we refer to as semantic multimedia. This knowledge may include components ranging from the subject matter of discourse to more general data such as about imaging, control strategy, etc. and may be used in equally diverse ways depending on the intended application such as personalised content summarisation, knowledge-assisted design, scientific modelling and semantics-based retrieval, as presented in the following chapters.

Cite

CITATION STYLE

APA

Kompatsiaris, Y., & Hobson, P. (2008). Introduction to semantic multimedia. Semantic Multimedia and Ontologies: Theory and Applications. Springer London. https://doi.org/10.1007/978-1-84800-076-6_1

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free