IM2City: Image Geo-localization via Multi-modal Learning

14Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This study investigated multi-modal learning as a stand-alone solution to image geo-localization problems. Based on the successful trials on the contrastive language-image pre-training (CLIP) model, we developed GEo-localization Multi-modal (GEM) models, which not only learn the visual features from input images, but also integrate the labels with corresponding geo-location context to generate textual features, which in turn are fused with the visual features for image geo-localization. We demonstrated that simply utilizing the image itself and appropriate contextualized prompts (i.e., mechanisms to integrate labels with geo-location context as textural features) is effective for global image geo-localization, which traditionally requires large amounts of geo-tagged images for image matching. Moreover, due to the integration of natural language, our GEM models are able to learn spatial proximity of geo-contextualized labels (i.e., their spatial closeness), which is often neglected by classification-based geo-localization methods. In particular, the proposed Zero-shot GEM model (i.e., geo-contextualized prompt tuning on CLIP) outperforms the state-of-the-art model - Individual Scene Networks (ISN), obtaining 4.1% and 49.5% accuracy improvements on the two benchmark datasets, IM2GPS3k and Place Plus 2.0 (i.e., 22k street view images across 56 cities worldwide), respectively. In addition, our proposed Linear-probing GEM model (i.e., CLIP's image encoder linearly trained on street view images) outperforms ISN even more significantly, obtaining 16.8% and 71.0% accuracy improvements, respectively. By exploring optimal geographic scales (e.g., city-level vs. country-level), training datasets (street view images vs. random online images), and pre-trained models (e.g., ResNet vs. CLIP for linearly probing), this research sheds light on integrating textural features with visual features for image geo-localization and beyond.

Cite

CITATION STYLE

APA

Wu, M., & Huang, Q. (2022). IM2City: Image Geo-localization via Multi-modal Learning. In Proceedings of the 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, GeoAI 2022 (pp. 50–61). Association for Computing Machinery, Inc. https://doi.org/10.1145/3557918.3565868

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free