Abstract
Fusion of multiple modalities images integrates complementary information from different sensors, creating a richer and more comprehensive representation. The traditional fusion methods adopt element-by-element addition and feature channel connection, often fail to fully fusion crucial information. To address these limitations, we propose a novel model based on vision transformers and an adaptive feature fusion network. Our model includes a multi-level feature decoupling layer to separate global and modality-specific features, combined with an attention-based adaptive dynamic fusion strategy. This strategy dynamically weights features based on their importance, enabling effective cross-modal fusion. Extensive experiments show our model’s superior performance, particularly in infrared-visible fusion, with significant improvements in metrics like mutual information (MI). Our approach not only preserves information from source images, but also produces fused images with high contrast and clear texture details. The results indicate the potential of our model in various applications, including military surveillance, remote sensing, and object detection. The code is available at https://github.com/jiejie2-code/ADF.git.
Author supplied keywords
Cite
CITATION STYLE
Xiao, W., Chen, J., Pan, C., Wang, T., & Jiang, L. (2025). Adaptive dynamic fusion of multi-modality features for enhanced image representation. Visual Computer, 41(12), 10055–10067. https://doi.org/10.1007/s00371-025-04021-5
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.