Sign up & Download
Sign in

TextonBoost : Joint Appearance , Shape and Context Modeling for Multi-Class Object Recognition and Segmentation

by J Shotton, J Winn, C Rother, A Criminisi
Context ()

Abstract

. This paper proposes a new approach to learning a discriminative model of object classes, incorporating appearance, shape and context information efficiently. The learned model is used for automatic visual recognition and semantic segmentation of photographs. Our discriminative model exploits novel features, based on textons, which jointly model shape and texture. Unary classification and feature selection is achieved using shared boosting to give an efficient classifier which can be applied to a large number of classes. Accurate image segmentation is achieved by incorporating these classifiers in a conditional random field. Efficient training of the model on very large datasets is achieved by exploiting both random feature selection and piecewise training methods. High classification and segmentation accuracy are demonstrated on three different databases: i) our own 21-object class database of photographs of real objects viewed under general lighting conditions, poses and viewpoints, ii) the 7-class Corel subset and iii) the 7-class Sowerby database used in 1. The proposed algorithm gives competitive results both for highly textured (e.g. grass, trees), highly structured (e.g. cars, faces, bikes, aeroplanes) and articulated objects (e.g. body, cow). 1

Cite this document (BETA)

Available from www.springerlink.com
Page 1
hidden

TextonBoost : Joint Appearance , ...

TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation J. Shotton2, J. Winn1, C. Rother1, and A. Criminisi1 1 Microsoft Research Ltd., Cambridge, UK {jwinn,carrot,antcrim}@microsoft.com 2 Department of Engineering, University of Cambridge jdjs2@cam.ac.uk Abstract. This paper proposes a new approach to learning a discrimi- native model of object classes, incorporating appearance, shape and con- text information efficiently. The learned model is used for automatic visual recognition and semantic segmentation of photographs. Our dis- criminative model exploits novel features, based on textons, which jointly model shape and texture. Unary classification and feature selection is achieved using shared boosting to give an efficient classifier which can be applied to a large number of classes. Accurate image segmentation is achieved by incorporating these classifiers in a conditional random field. Efficient training of the model on very large datasets is achieved by ex- ploiting both random feature selection and piecewise training methods. High classification and segmentation accuracy are demonstrated on three different databases: i) our own 21-object class database of photographs of real objects viewed under general lighting conditions, poses and view- points, ii) the 7-class Corel subset and iii) the 7-class Sowerby database used in [1]. The proposed algorithm gives competitive results both for highly textured (e.g. grass, trees), highly structured (e.g. cars, faces, bikes, aeroplanes) and articulated objects (e.g. body, cow). 1 Introduction This paper investigates the problem of achieving automatic detection, recog- nition and segmentation of object classes in photographs. Precisely, given an image, the system should automatically partition it into semantically meaning- ful areas each labeled with a specific object class. The challenge is to handle a large number of both structured and unstructured object classes, while model- ing their variabilities. Our focus is not only the accuracy of segmentation and recognition, but also the efficiency of the algorithm, which becomes particularly important when dealing with large image collections. At a local level, the appearance of an image patch leads to ambiguities in its class label. For example, a window can be part of a car, a building or an aeroplane. To overcome these ambiguities, it is necessary to incorporate longer range information such as the spatial configuration of the patches on an object (the object shape) and also contextual information from the surrounding image. To achieve this we construct a discriminative model for labeling images which exploits all three types of information: appearance, shape and context.
Page 2
hidden
2 Related work. Whilst the fields of object recognition and segmentation have been extremely active in recent years, many authors have considered these two tasks separately. For example, recognition of particular object classes has been achieved using the constellation models of Fergus et al. [2], the deformable shape models of Berg et al. [3] and the texture models of Winn et al. [4]. None of these methods leads to a pixel-wise segmentation of the image. Conversely, other authors have considered only the segmentation task, e.g. [5,6]. Joint detection and segmentation of a single object class has been achieved by several authors [7���9]. Typically, these approaches exploit a global shape model and are therefore unable to cope with arbitrary viewpoints or severe occlusion. Additionally, only highly structured object classes are addressed. A similar task as addressed in this paper was considered in [10] where a classifier was used to label regions found by automatic segmentation. However such segmentations often do not correlate with semantic objects. Our solution to this problem is to perform segmentation and recognition in the same unified framework rather than in two separate steps. Such a unified approach has been presented in [11] where only text and faces are recognized and at a high compu- tational cost. Konishi and Yuille [12] label images using a unary classifier and hence do not achieve spatially coherent segmentations. The most similar work to ours is that of He et al. [1] which incorporate region and global label features to model shape and context in a Conditional Random Field. Their work uses Gibbs sampling for both the parameter learning and label inference and is therefore limited in the size of dataset and number of classes which can be handled efficiently. Our focus on the speed of training and inference allows us to use larger datasets with many more object classes. We currently handle 21 classes (compared to the seven classes of [1]) and it would be tractable to train our model on even larger datasets than presented here. Our contributions in this paper are threefold. First, we present a discrimina- tive model which is capable of fusing shape, appearance and context information to recognize efficiently the object classes present in an image, whilst exploiting edge information to provide an accurate segmentation. Second, we propose fea- tures, based on textons, which are capable of modeling object shape, appearance and context. Finally, we demonstrate how to train the model efficiently on a very large dataset by exploiting both boosting and piecewise training methods. The paper is structured as follows. In the next section we describe the image database used in our experiments. Section 3 introduces the high-level model, a Conditional Random Field, while section 4 presents our novel low-level image features and their use in constructing a boosted classifier. Experiments, perfor- mance evaluation and conclusions are given in the final two sections. 2 Image Databases Our object class models are learned from a set of labeled training images. In this paper we consider three different labeled image databases. Our own database3 is composed of 591 photographs of the following 21 object classes: building, 3 Publicly available at http://research.microsoft.com/vision/cambridge/recognition/.

Readership Statistics

171 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
55% Ph.D. Student
 
9% Student (Master)
 
7% Post Doc
by Country
 
27% United States
 
16% China
 
8% United Kingdom

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in