Hierarchical Neural Networks for ...
Sven Behnke Hierarchical Neural Networks for Image Interpretation June 13, 2003 Draft submitted to Springer-Verlag Published as volume 2766 of Lecture Notes in Computer Science ISBN: 3-540-40722-7
Sven Behnke Hierarchical Neural Networks for Image Interpretation June 13, 2003 Draft submitted to Springer-Verlag Published as volume 2766 of Lecture Notes in Computer Science ISBN: 3-540-40722-7
Foreword It is my pleasure and privilege to write the foreword for this book, whose results I have been following and awaiting for the last few years. This monograph represents the outcome of an ambitious project oriented towards advancing our knowledge of the way the human visual system processes images, and about the way it combines high level hypotheses with low level inputs during pattern recognition. The model proposed by Sven Behnke, carefully exposed in the following pages, can be applied now by other researchers to practical problems in the field of computer vision and provides also clues for reaching a deeper understanding of the human visual system. This book arose out of dissatisfaction with an earlier project: back in 1996, Sven wrote one of the handwritten digit recognizers for the mail sorting machines of the Deutsche Post AG. The project was successful because the machines could in- deed recognize the handwritten ZIP codes, at a rate of several thousand letters per hour. However, Sven was not satisfied with the amount of expert knowledge that was needed to develop the feature extraction and classification algorithms. He won- dered if the computer could be able to extract meaningful features by itself, and use these for classification. His experience in the project told him that forward compu- tation alone would be incapable of improving the results already obtained. From his knowledge of the human visual system, he postulated that only a two-way system could work, one that could advance a hypothesis by focussing the attention of the lower layers of a neural network on it. He spent the next few years developing a new model for tackling precisely this problem. The main result of this book is the proposal of a generic architecture for pattern recognition problems, called Neural Abstraction Pyramid (NAP). The architecture is layered, pyramidal, competitive, and recurrent. It is layered because images are represented at multiple levels of abstraction. It is recurrent because backward pro- jections connect the upper to the lower layers. It is pyramidal because the resolution of the representations is reduced from one layer to the next. It is competitive be- cause in each layer units compete against each other, trying to classify the input best. The main idea behind this architecture is letting the lower layers interact with the higher layers. The lower layers send some simple features to the upper layers, the uppers layers recognize more complex features and bias the computation in the lower layers. This in turn improves the input to the upper layers, which can refine their hypotheses, and so on. After a few iterations the network settles in the best in- terpretation. The architecture can be trained in supervised and unsupervised mode.
VI Here, I should mention that there have been many proposals of recurrent ar- chitectures for pattern recognition. Over the years we have tried to apply them to non-trivial problems. Unfortunately, many of the proposals advanced in the litera- ture break down when confronted with non-toy problems. Therefore, one of the first advantages present in Behnke���s architecture is that it actually works, also when the problem is difficult and really interesting for commercial applications. The structure of the book reflects the road taken by Sven to tackle the problem of combining top-down processing of hypotheses with bottom-up processing of im- ages. Part I describes the theory and Part II the applications of the architecture. The first two chapters motivate the problem to be investigated and identify the features of the human visual system which are relevant for the proposed architecture: retino- topic organization of feature maps, local recurrence with excitation and inhibition, hierarchy of representations, and adaptation through learning. Chapter 3 gives an overview of several models proposed in the last years and provides a gentle introduction to the next chapter, which describes the NAP archi- tecture. Chapter 5 deals with a special case of the NAP architecture, when only forward projections are used and features are learned in an unsupervised way. With this chapter, Sven came full circle: the digit classification task he had solved for mail sorting, using a hand-designed structural classifier, was outperformed now by an automatically trained system. This is a remarkable result, since much expert knowl- edge went into the design of the hand-crafted system. Four applications of the NAP constitute Part II. The first application is the recog- nition of meter values (printed postage stamps), the second the binarization of ma- trix codes (also used for postage), the third is the reconstruction of damaged images, and the last is the localization of faces in complex scenes. The image reconstruction problem is my favorite regarding the kind of tasks solved. A complete NAP is used, with all its lateral, feed-forward and backward connections. In order to infer the original images from degraded ones, the network must learn models of the objects present in the images and combine them with models of typical degradations. I think that it is interesting how this book started from a general inspiration about the way the human visual system works, how then Sven extracted some gen- eral principles underlying visual perception and how he applied them to the solution of several vision problems. The NAP architecture is what the Neocognitron (a lay- ered model proposed by Fukushima the 1980s) aspired to be. It is the Neocognitron gotten right. The main difference between one and the other is the recursive na- ture of the NAP. Combining the bottom-up with the top-down approach allows for iterative interpretation of ambiguous stimuli. I can only encourage the reader to work his or her way through this book. It is very well written and provides solutions for some technical problems as well as inspiration for neurobiologists interested in common computational principles in hu- man and computer vision. The book is like a road that will lead the attentive reader to a rich landscape, full of new research opportunities. Berlin, June 2003 Ra��ul Rojas
Preface This thesis is published in partial fulfillment of the requirements for the degree of ���Doktor der Naturwissenschaften��� (Dr. rer. nat.) at the Department of Mathematics and Computer Science of Freie Universit��at Berlin. Prof. Dr. Ra��ul Rojas (FU Berlin) and Prof. Dr. Volker Sperschneider (Osnabr��uck) acted as referees. The thesis was defended on November 27, 2002. Summary of the Thesis Human performance in visual perception by far exceeds the performance of con- temporary computer vision systems. While humans are able to perceive their envi- ronment almost instantly and reliably under a wide range of conditions, computer vision systems work well only under controlled conditions in limited domains. This thesis addresses the differences in data structures and algorithms underly- ing the differences in performance. The interface problem between symbolic data manipulated in high-level vision and signals processed by low-level operations is identified as one of the major issues of today���s computer vision systems. This thesis aims at reproducing the robustness and speed of human perception by proposing a hierarchical architecture for iterative image interpretation. I propose to use hierarchical neural networks for representing images at multiple abstraction levels. The lowest level represents the image signal. As one ascends these levels of abstraction, the spatial resolution of two-dimensional feature maps decreases while feature diversity and invariance increase. The representations are obtained using simple processing elements that interact locally. Recurrent horizontal and vertical interactions are mediated by weighted links. Weight sharing keeps the number of free parameters low. Recurrence allows to integrate bottom-up, lateral, and top-down influences. Image interpretation in the proposed architecture is performed iteratively. An image is interpreted first at positions where little ambiguity exists. Partial results then bias the interpretation of more ambiguous stimuli. This is a flexible way to in- corporate context. Such a refinement is most useful when the image contrast is low, noise and distractors are present, objects are partially occluded, or the interpretation is otherwise complicated. The proposed architecture can be trained using unsupervised and supervised learning techniques. This allows to replace manual design of application-specific
VIII computer vision systems with the automatic adaptation of a generic network. The task to be solved is then described using a dataset of input/output examples. Applications of the proposed architecture are illustrated using small networks. Furthermore, several larger networks were trained to perform non-trivial computer vision tasks, such as the recognition of the value of postage meter marks and the binarization of matrixcodes. It is shown that image reconstruction problems, such as super-resolution, filling-in of occlusions, and contrast enhancement/noise removal, can be learned as well. Finally, the architecture was applied successfully to localize faces in complex office scenes. The network is also able to track moving faces. Acknowledgements My profound gratitude goes to Professor Ra��ul Rojas, my mentor and research advi- sor, for guidance, contribution of ideas, and encouragement. I salute Ra��ul���s genuine passion for science, discovery and understanding, superior mentoring skills, and un- paralleled availability. The research for this thesis was done at the Computer Science Institute of the Freie Universit��at Berlin. I am grateful for the opportunity to work in such a stim- ulating environment, embedded in the exciting research context of Berlin. The AI group has been host to many challenging projects, e.g. to the RoboCup FU-Fighters project and to the E-Chalk project. I owe a great deal to the members and former members of the group. In particular, I would like to thank Alexander Gloye, Bern- hard Fr��otschl, Jan D��osselmann, and Dr. Marcus Pfister for helpful discussions. Parts of the applications were developed in close cooperation with Siemens ElectroCom Postautomation GmbH. Testing the performance of the proposed ap- proach on real-world data was invaluable to me. I am indebted to Torsten Lange, who was always open for unconventional ideas and gave me detailed feedback, and to Katja Jakel, who prepared the databases and did the evaluation of the experiments. My gratitude goes also to the people who helped me to prepare the manuscript of the thesis. Dr. Natalie Hempel de Ibarra made sure that the chapter on the neu- robiological background reflects current knowledge. Gerald Friedland, Mark Si- mon, Alexander Gloye, and Mary Ann Brennan helped by proofreading parts of the manuscript. Special thanks go to Barry Chen who helped me to prepare the thesis for publication. Finally, I wish to thank my family for their support. My parents have always encouraged and guided me to independence, never trying to limit my aspirations. Most importantly, I thank Anne, my wife, for showing untiring patience and moral support, reminding me of my priorities and keeping things in perspective. Berkeley, June 2003 Sven Behnke
Table of Contents Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Importance of Visual Perception . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Performance of the Human Visual System . . . . . . . . . . . . . . . 2 1.1.3 Limitations of Current Computer Vision Systems . . . . . . . . . 6 1.1.4 Iterative Interpretation ��� Local Interactions in a Hierarchy . . 9 1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Part I. Theory 2. Neurobiological Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Visual Pathways. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Feature Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Synapses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Hierarchical Image Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.1 Generic Signal Decompositions . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.2 Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.3 Generative Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Recurrent Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.1 Models with Lateral Interactions . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.2 Models with Vertical Feedback . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.3 Models with Lateral and Vertical Feedback . . . . . . . . . . . . . . . 61
X Table of Contents 3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4. Neural Abstraction Pyramid Architecture . . . . . . . . . . . . . . . . . . . . . . . 65 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.1 Hierarchical Network Structure . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.2 Distributed Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.3 Local Recurrent Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1.4 Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Formal Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.1 Simple Processing Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.2 Shared Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.3 Discrete-Time Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.4 Various Transfer Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Example Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.1 Local Contrast Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.2 Binarization of Handwriting . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.3 Activity-Driven Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.4 Invariant Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5. Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.2 Learning a Hierarchy of Sparse Features . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2.3 Hebbian Weight Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2.4 Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3 Learning Hierarchical Digit Features . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 Digit Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6. Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1.1 Nearest Neighbor Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.1.3 Bayesian Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.1.4 Support Vector Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.1.5 Bias/Variance Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2 Feed-Forward Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2.1 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2.2 Improvements to Backpropagation . . . . . . . . . . . . . . . . . . . . . . 121 6.2.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.3.1 Backpropagation Through Time . . . . . . . . . . . . . . . . . . . . . . . . 125 6.3.2 Real-Time Recurrent Learning . . . . . . . . . . . . . . . . . . . . . . . . . 126
Table of Contents XI 6.3.3 Difficulty of Learning Long-Term Dependencies . . . . . . . . . . 127 6.3.4 Random Recurrent Networks with Fading Memories . . . . . . 128 6.3.5 Robust Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Part II. Applications 7. Recognition of Meter Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.1 Introduction to Meter Value Recognition . . . . . . . . . . . . . . . . . . . . . . . 135 7.2 Swedish Post Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.3.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.3.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.4 Block Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.4.1 Network Architecture and Training . . . . . . . . . . . . . . . . . . . . . 144 7.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.5 Digit Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.5.1 Digit Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.5.2 Digit Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.5.3 Combination with Block Recognition . . . . . . . . . . . . . . . . . . . 151 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8. Binarization of Matrix Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.1 Introduction to Two-Dimensional Codes . . . . . . . . . . . . . . . . . . . . . . . 155 8.2 Canada Post Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.3 Adaptive Threshold Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.4 Image Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.5 Learning Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 9. Learning Iterative Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 173 9.1 Introduction to Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9.2 Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 9.2.1 NIST Digits Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 9.2.2 Architecture for Super-Resolution . . . . . . . . . . . . . . . . . . . . . . 176 9.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.3 Filling-in Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 9.3.1 MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 9.3.2 Architecture for Filling-In of Occlusions . . . . . . . . . . . . . . . . . 182 9.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 9.4 Noise Removal and Contrast Enhancement . . . . . . . . . . . . . . . . . . . . . 186 9.4.1 Image Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 9.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
XII Table of Contents 9.5 Reconstruction from a Sequence of Degraded Digits . . . . . . . . . . . . . 189 9.5.1 Image Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 9.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 9.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 10. Face Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 10.1 Introduction to Face Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 10.2 Face Database and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 10.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 10.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 10.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 11. Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 11.1 Short Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 11.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 11.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 11.3.1 Implementation Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 11.3.2 Using more Complex Processing Elements . . . . . . . . . . . . . . . 216 11.3.3 Integration into Complete Systems . . . . . . . . . . . . . . . . . . . . . . 217
1. Introduction 1.1 Motivation 1.1.1 Importance of Visual Perception Visual perception is important for both humans and computers. Humans are visual animals. Just imagine how loosing your sight would effect you to appreciate its importance. We extract most information about the world around us by seeing. This is possible because photons sensed by the eyes carry information about the world. On their way from light sources to the photoreceptors they interact with objects and get altered by this process. For instance, the wavelength of a photon may reveal information about the color of a surface it was reflected from. Sudden changes in the intensity of light along a line may indicate the edge of an object. By analyzing intensity gradients, the curvature of a surface may be recovered. Texture or the type of reflection can be used to further characterize surfaces. The change of visual stimuli over time is an important source of information as well. Motion may indicate the change of an object���s pose or reflect ego-motion. Synchronous motion is a strong hint for segmentation, the grouping of visual stimuli to objects because parts of the same object tend to move together. Vision allows us to sense over long distance since light travels through the air without significant loss. It is non-destructive and, if no additional lighting is used, it is also passive. This allows for perception without being noticed. Since we have a powerful visual system, we designed our environment to pro- vide visual cues. Examples include marked lanes on the roads and traffic lights. Our interaction with computers is based on visual information as well. Large screens display the data we manipulate and printers produce documents for later visual per- ception. Powerful computer graphic systems have been developed to feed our visual sys- tem. Today���s computers include special-purpose processors for rendering images. They produce almost realistic perceptions of simulated environments. On the other hand, the communication channel from the users to computers has a very low bandwidth. It consists mainly of the keyboard and a pointing device. More natural interaction with computers requires advanced interfaces, including computer vision components. Recognizing the user and perceiving his or her actions are key prerequisites for more intelligent user interfaces.
2 1. Introduction Computer vision, that is the extraction of information from images and image se- quences, is also important for applications other than human-computer interaction. For instance, it can be used by robots to extract information from their environment. In the same way visual perception is crucial for us, it is for autonomous mobile robots acting in the world designed for us. A driver assistance system in a car, for example, must perceive all the signs and markings on the road, as well as other cars, pedestrians, and many more objects. Computer vision techniques are also used for the analysis of static images. In medical imaging, for example, it can be used to aid the interpretation of images by a physician. Another application area is the automatic interpretation of satellite images. One particularly successful application of computer vision techniques is the reading of documents. Machines for check reading and mail sorting are widely used. 1.1.2 Performance of the Human Visual System Human performance for visual tasks is impressive. The human visual system per- ceives stimuli of a high dynamic range. It works well in the brightest sunlight and still allows for orientation under limited lighting conditions, e.g. at night. It has been shown that we can even perceive single photons. Under normal lighting, the system has high acuity. We are able to perceive object details and can recognize far-away objects. Humans can also perceive color. When presented next to each other, we can distinguish thousands of color nuances. The visual system manages to separate objects from other objects and the back- ground. We are also able to separate object-motion from ego-motion. This facilitates the detection of change in the environment. One of the most remarkable features of the human visual system is its ability to recognize objects under transformations. Moderate changes in illumination, object pose, and size do not affect perception. Another invariance produced by the visual system is color constancy. By accounting for illumination changes, we perceive dif- ferent wavelength mixtures as the same color. This inference process recovers the reflectance properties of surfaces, the object color. We are also able to tolerate de- formations of non-rigid objects. Object categorization is another valuable property. If we have seen several examples of a category, say dogs, we can easily classify an unseen animal as dog if it has the typical dog features. The human visual system is strongest for the stimuli that are most important to us: faces, for instance. We are able to distinguish thousands of different faces. On the other hand, we can recognize a person although he or she has aged, changed hair style and now wears glasses. Human visual perception is not only remarkably robust to variances and noise, but it is fast as well. We need only about 100ms to extract the basic gist of a scene, we can detect targets in naturalistic scenes in 150ms, and we are able to understand complicated scenes within 400ms. Visual processing is mostly done subconsciously. We do not perceive the diffi- culties involved in the task of interpreting natural stimuli. This does not mean that this task is easy. The challenge originates in the fact that visual stimuli are frequently
1.1 Motivation 3 (a) (b) Fig. 1.1. Role of occluding region in recognition of occluded letters: (a) letters ���B��� partially occluded by a black line (b) same situation, but the occluding line is white (it merges with the background recognition is much more difficult) (image from [164]). (a) (b) Fig. 1.2. Light-from-above assumption: (a) stimuli in the middle column are perceived as concave surfaces whereas stimuli on the sides appear to be convex (b) rotation by 180��� makes convex stimuli concave and vice versa. ambiguous. Inferring three-dimensional structure from two-dimensional images, for example, is inherently ambiguous. Many 3D objects correspond to the same image. The visual system must rely on various depth cues to infer the third dimension. Another example is the interpretation of spatial changes in intensity. Among their potential causes are changes in the reflectance of an object���s surface (e.g. texture), inhomogeneous illumination (e.g. at the edge of a shadow) and the discontinuity of the reflecting surface at the object borders. Occlusions are a frequent source of ambiguity as well. Our visual system must guess what occluded object parts look like. This is illustrated in Figure 1.1. We are able to recognize the letters ���B���, which are partially occluded by a black line. If the occluding line is white, the interpretation is much more challenging, because the occlusion is not detected and the ���guessing mode��� is not employed. Since the task of interpreting ambiguous stimuli is not well-posed, prior knowl- edge must be used for visual inference. The human visual system uses many heuris- tics to resolve ambiguities. One of the assumptions, the system relies on, is that light comes from above. Figure 1.2 illustrates this fact. Since the curvature of surfaces can be inferred from shading only up to the ambiguity of a convex or a concave inter- pretation, the visual system prefers the interpretation that is consistent with a light source located above the object. This choice is correct most of the time.
4 1. Introduction (a) (b) (c) (d) (e) (f) Fig. 1.3. Gestalt principles of perception [125]: (a) similar stimuli are grouped together (b) proximity is another cue for grouping (c) line segments are grouped based on good contin- uation (d) symmetric contours form objects (e) closed contours are more salient than open ones (f) connectedness and belonging to a common region cause grouping. Fig. 1.4. Kanizsa figures [118]. Four inducers produce the percept of a white square partially occluding four black disks. Line endings induce illusory contours perpendicular to the lines. The square can be bent if the opening angles of the arcs are slightly changed. Other heuristics are summarized by the Gestalt principles of perception [125]. Some of them are illustrated in Figure 1.3. Gestalt psychology emphasizes the Pr��agnanz of perception: stimuli group spontaneously into the simplest possible con- figuration. Examples include the grouping of similar stimuli (see Part (a)). Proximity is another cue for grouping (b). Line segments are connected based on good con- tinuation (c). Symmetric or parallel contours indicate that they belong to the same object (d). Closed contours are more salient than open ones (e). Connectedness and belonging to a common region cause grouping as well (f). Last, but not least, com- mon fate (synchrony in motion) is a strong hint that stimuli belong to the same object. Although such heuristics are correct most of the time, sometimes they fail. This results in unexpected perceptions, called visual illusions. One example of these il- lusions are Kanizsa figures [118], shown in Figure 1.4. In the left part of the figure, four inducers produce the percept of a white square in front of black disks, because
1.1 Motivation 5 (a) (b) (c) Fig. 1.5. Visual illusions: (a) M��uller-Lyer illusion [163] (the vertical lines appear to have dif- ferent lengths) (b) horizontal-vertical illusion (the vertical line appears to be longer than the horizontal one) (c) Ebbinghaus-Titchener illusion (the central circles appear to have different sizes). (a) (b) Fig. 1.6. Munker-White illusion [224] illustrates contextual effects of brightness perception: (a) both diagonals have the same brightness (b) same situation without occlusion. this interpretation is the simplest one. Illusory contours are perceived between the inducers, although there is no intensity change. The middle of the figure shows that virtual contours are also induced at line endings perpendicular to the lines because occlusions are likely causes of line endings. In the right part of the figure it is shown that one can even bend the square, if the opening angles of the arc segments are slightly changed. Three more visual illusions are shown in Figure 1.5. In the M�� uller-Lyer illu- sion [163] (Part (a)), two vertical lines appear to have different lengths, although they are identical. This perception is caused by the different three-dimensional in- terpretation of the junctions at the line endings. The left line is interpreted as the convex edge of two meeting walls, whereas the right line appears to be a concave corner. Part (b) of the figure shows the horizontal-vertical illusion. The vertical line appears to be longer than the horizontal one, although both have the same length. In Part (c), the Ebbinghaus-Titchener illusion is shown. The perceived size of the central circle depends on the size of the black circles surrounding it. Contextual effects of brightness perception are illustrated by the Munker-White illusion [224], shown in Figure 1.6. Two gray diagonals are partially occluded by a black-and-white pattern of horizontal stripes. The perceived brightness of the diag- onals is very different, although they have the same reflectance. This illustrates that