Sign up & Download
Sign in

Disentangling scene content from spatial boundary: complementary roles for the parahippocampal place area and lateral occipital complex in representing real-world scenes.

by Soojin Park, Timothy F Brady, Michelle R Greene, Aude Oliva
Journal of Neuroscience ()

Abstract

Behavioral and computational studies suggest that visual scene analysis rapidly produces a rich description of both the objects and the spatial layout of surfaces in a scene. However, there is still a large gap in our understanding of how the human brain accomplishes these diverse functions of scene understanding. Here we probe the nature of real-world scene representations using multivoxel functional magnetic resonance imaging pattern analysis. We show that natural scenes are analyzed in a distributed and complementary manner by the parahippocampal place area (PPA) and the lateral occipital complex (LOC) in particular, as well as other regions in the ventral stream. Specifically, we study the classification performance of different scene-selective regions using images that vary in spatial boundary and naturalness content. We discover that, whereas both the PPA and LOC can accurately classify scenes, they make different errors: the PPA more often confuses scenes that have the same spatial boundaries, whereas the LOC more often confuses scenes that have the same content. By demonstrating that visual scene analysis recruits distinct and complementary high-level representations, our results testify to distinct neural pathways for representing the spatial boundaries and content of a visual scene.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Disentangling scene content from ...

Behavioral/Systems/Cognitive Disentangling Scene Content from Spatial Boundary: Complementary Roles for the Parahippocampal Place Area and Lateral Occipital Complex in Representing Real-World Scenes Soojin Park, Timothy F. Brady, Michelle R. Greene, and Aude Oliva Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 Behavioral and computational studies suggest that visual scene analysis rapidly produces a rich description of both the objects and the spatial layout of surfaces in a scene. However, there is still a large gap in our understanding of how the human brain accomplishes these diverse functions of scene understanding. Here we probe the nature of real-world scene representations using multivoxel functional magnetic resonance imaging pattern analysis. We show that natural scenes are analyzed in a distributed and complementary manner by theparahippocampalplacearea(PPA)andthelateraloccipitalcomplex(LOC)inparticular,aswellasotherregionsintheventralstream. Specifically, we study the classification performance of different scene-selective regions using images that vary in spatial boundary and naturalness content. We discover that, whereas both the PPA and LOC can accurately classify scenes, they make different errors: the PPA more often confuses scenes that have the same spatial boundaries, whereas the LOC more often confuses scenes that have the same content. By demonstrating that visual scene analysis recruits distinct and complementary high-level representations, our results testify to distinct neural pathways for representing the spatial boundaries and content of a visual scene. Introduction Behavioral studies have shown that, in a brief glance at a scene, a rich representation is built comprising spatial layout, the degree of human manufacture, and a few prominent objects (Oliva and Torralba, 2001 Renninger and Malik, 2004 Fei-Fei et al., 2007 Greene and Oliva, 2009a,b). In parallel, neuroimaging investiga- tions have identified specific brain regions involved in scene per- ception. Among these regions is the parahippocampal place area (PPA), which responds preferentially to pictures of scenes and landmarks and shows selectivity to the geometric layout of the scene but not the quantity of objects (Aguirre et al., 1998 Epstein and Kanwisher, 1998 Epstein et al., 1999 Janzen and van Turen- nout, 2004) and the retrosplenial complex (RSC), which also responds to scenes and navigationally relevant tasks (Epstein, 2008 Park and Chun, 2009) in addition to processing context (Bar and Aminoff, 2003). In contrast, the lateral occipital com- plex (LOC) has been found to represent object shapes and cate- gories (Malach et al., 1995 Grill-Spector et al., 1998 Kourtzi and Kanwisher, 2000 Eger et al., 2008 Vinberg and Grill-Spector, 2008). Recent studies have found that activity in early visual areas PPA and LOC is discriminative enough to allow scene classification into a handful of semantic categories (Naselaris et al., 2009 Walther et al., 2009). However, we do not yet know how the brain accomplishes the diverse functions involved in scene understanding. Here, we examined scene representation using functional magnetic resonance imaging (fMRI) pattern analysis, demon- strating the existence of high-level neural representations of visual environments that uncouple processing of the spatial boundaries of a scene from its content. Just as external shape and internal features are separable dimensions of face encoding, an environmental space can be represented by two separable and complementary descriptors (Oliva and Torralba, 2001): its spa- tial boundary (i.e., the shape and size of the scene���s space) and its content (textures, surfaces, materials and objects). As illustrated in Figure 1, the shape of a scene may be expansive and open to the horizon, as in a field or highway, or closed and bounded by fron- tal and lateral surfaces, as in forests or streets. For a given spatial boundary, a scene may comprise natural or urban (manufac- tured) objects. Analyzing the types of errors produced by the PPA and LOC in this two-dimensional space allows us distinguish whether the PPA and LOC represent a scene in an overlapping manner (e.g., both produce similar errors when classifying scenes) or in a complementary manner (e.g., specialization in representing boundaries and content). We show that, although both the PPA and the LOC classify scenes with the same level of accuracy, these regions show oppo- site patterns of classification errors. Therefore, our work provides Received July 26, 2010 revised Oct. 26, 2010 accepted Oct. 30, 2010. This work was funded by National Science Foundation Graduate Research Fellowships (T.F.B., M.R.G.) and Na- tional Science Foundation CAREER Award IIS 0546262 (A.O.). We thank the Athinoula A. Martinos Center at the McGovern Institute for Brain Research, Massachusetts Institute of Technology for data acquisition, and Talia Konkle and Barbara Hidalgo-Sotelo for helpful conversation and comments on this manuscript. Correspondence should be addressed to either Soojin Park or Aude Oliva, Department of Brain and Cognitive Sciences, Room 46-4065, 77 Massachusetts Avenue, Massachusetts Institute of Technology, Cambridge, MA 02139. E-mail: sjpark31@mit.edu, oliva@mit.edu. DOI:10.1523/JNEUROSCI.3885-10.2011 Copyright �� 2011 the authors 0270-6474/11/311333-08$15.00/0 The Journal of Neuroscience, January 26, 2011 ��� 31(4):1333���1340 ��� 1333
Page 2
hidden
the first evidence that multiple brain regions perform distinct and complementary analysis of a visual scene, similar in spirit to that proposed by computational models of scene understanding (Oliva and Torralba, 2001 Vogel and Schiele, 2007 Greene and Oliva, 2009a) and visual search (Torralba, 2003 Torralba et al., 2006). Materials and Methods Subjects. Eight participants (two females one left-handed ages, 19���28 years) for the main experiment, six participants (three females ages, 18���29 years) for the first control experiment (using phase-scrambled images), and seven participants (three females ages, 20���35 years) for the second control experiment (using added vertical and horizontal bars) were recruited from the Massachusetts Institute of Technology commu- nity for financial compensation. All had normal or corrected-to-normal vision. Informed consent was obtained, and the study protocol was ap- proved by the Institutional Review Board of the Massachusetts Institute of Technology. One participant for the main experiment was excluded from the analyses because of excessive head movement (over 8 mm across runs). Visual stimuli. Scenes were carefully chosen to represent each of the following four scene groups: ���open natural��� images, ���closed natural��� images, ���open urban��� images, and ���closed urban��� images (Oliva and Torralba, 2001 Greene and Oliva, 2009a). Images were visually matched for spatial boundary and content across these groups (see examples in Figs. 1, 3a). Importantly, each scene group included multiple basic-level scene categories. For example, open and closed natural images included different views of fields, oceans, forests, creeks, mountains, and deserts, whereas open and closed urban images included views of highways, park- ing lots, streets, city canals, buildings, and airports. There were 140 test images per scene group. In the main experiment, photographs were 256 256 pixels resolution (4.5�� 4.5�� of visual angle) and were pre- sented in grayscale with a mean luminance averaging 127 (on a 0���255 luminance scale). In the first control experiment, the same images were phase scrambled, so as to keep second-order image statistics but remove high-level scene information. The grayscale images from experiment 1 were first Fourier transformed to decompose them into their amplitude spectrum and phase. Next, the phase at each frequency was replaced with a random phase. Finally, the image was reconstructed from this modified Fourier space and then rescaled so the luminance of each pixel ranged from 0 to 255, and the overall image had mean luminance of 127. In the second control experiment, horizontal or vertical lines were superposed on top of the images and were presented 500 500 pixels resolution to maximize the visibility of lines (for control experiment stimuli, see Fig. 6). The same orientation lines were added on top of either all of the natural scenes or all of the urban scenes to increase the within-content low-level image similarity across these two sets of conditions. Images were presented in the scanner using a Hitachi (CP-X1200 series) rear- projection screen. Experimental design. Twenty images from an image group were pre- sented in blocks of 20 s each. The order of block conditions was random- ized within each run. Each block was followed by a 10 s fixation period. Within a block, each scene was displayed for 800 ms, followed by 200 ms blank. The entire image set (560 images) was presented across two runs with a break between: the first run was composed of 16 blocks with four blocks per condition, acquiring 245 image volumes the second run was composed of 12 blocks with three blocks per condition, acquiring 185 image volumes. This set of two runs was repeated four times within a session, totaling eight runs, to increase the number of samples and power. Accordingly, participants saw the same image four times across runs at different time points per each run. Twenty-eight blocks per each condition were acquired and used as training samples throughout the experiment. Participants performed a one-back repetition detection task to maintain attention. The experimental design for the first and second control experiments were identical to the main experiment except that participants in the first control experiment performed a red-frame detection task rather than a one-back repetition task to maintain attention on the phase scrambled images. MRI acquisition and preprocessing. Imaging data were acquired with a 3 T Siemens fMRI scanner with 32-channel phased-array head coil (Sie- mens) at the Martinos Center at the McGovern Institute for Brain Re- search at Massachusetts Institute of Technology. Anatomical images were acquired using a high-resolution (1 1 1 mm voxel) magnetization-prepared rapid-acquisition gradient echo structural scan. Functional images were acquired with a gradient echo-planar T2* se- quence [repetition time (TR), 2 s echo time, 30 ms field of view, 200 mm 64 64 matrix flip angle, 90�� in-plane resolution, 3.1 3.1 3.1 mm 33 axial 3.1 mm slices with no gap acquired parallel to the anterior commissure���posterior commissure line]. Figure 1. A schematic illustration of how pictures of real-world scenes can be uniquely defined by their spatial boundary information and content. Note that the configuration, size, and locations of components were corresponding between natural and urban environments. a, Keeping the enclosed spatial boundary, if we strip off the natural content of a forest and fill the space with urban contents, then the scene becomes an urban street scene. b, Keeping the open spatial boundary, if we strip off the natural content of a field and fill the space with urban contents, then the scene becomes an urban parking lot. 1334 ��� J. Neurosci., January 26, 2011 ��� 31(4):1333���1340 Park et al. ��� Disentangling Scene Content from Spatial Boundary

Authors on Mendeley

Readership Statistics

71 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
51% Ph.D. Student
 
18% Post Doc
 
10% Researcher (at an Academic Institution)
by Country
 
42% United States
 
15% United Kingdom
 
7% Canada

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in