Human visual recognition is outstandingly robust. People can recognize thousands of object classes in the blink of an eye (50-200 ms) even when the objects vary in position, scale, viewpoint, and illumination. What aspects of human category learning facilitate the extraction of invariant visual features for object recognition? Here, we explore the possibility that a contributing factor to learning such robust visual representations may be a taxonomic hierarchy communicated in part by common labels to which people are exposed as part of natural language. We did this by manipulating the taxonomic level of labels (e.g., superordinate-level [mammal, fruit, vehicle] and basic-level [dog, banana, van]), and the order in which these training labels were used during learning by a Convolutional Neural Network. We found that training the model with hierarchical labels yields visual representations that are more robust to image transformations (e.g., position/scale, illumination, noise, and blur), especially when images were first trained with superordinate labels and then fine-tuned with basic labels. We also found that Superordinate-label followed by Basic-label training best predicts functional magnetic resonance imaging responses in visual cortex and behavioral similarity judgments recorded while viewing naturalistic images. The benefits of training with superordinate labels in the earlier stages of category learning is discussed in the context of representational efficiency and generalization.
CITATION STYLE
Ahn, S., Zelinsky, G. J., & Lupyan, G. (2021). Use of superordinate labels yields more robust and human-like visual representations in convolutional neural networks. Journal of Vision, 21(13), 1–19. https://doi.org/10.1167/JOV.21.13.13
Mendeley helps you to discover research relevant for your work.