Image-recipe retrieval, which aims at retrieving the relevant recipe from a food image and vice versa, is now attracting widespread attention, since sharing food-related images and recipes on the Internet has become a popular trend. Existing methods have formulated this problem as a typical cross-modal retrieval task by learning the image-recipe similarity. Though these methods have made inspiring achievements for image-recipe retrieval, they may still be less effective to jointly incorporate the three crucial points: (1) the association between ingredients and instructions, (2) fine-grained image information, and (3) the latent alignment between recipes and images. To this end, we propose a novel framework namedHybrid Fusion with Intra- and Cross-Modality Attention (HF-ICMA) to learn accurate image-recipe similarity. Our HF-ICMA model adopts an intra-recipe fusion module to focus on the interaction between ingredients and instructions within a recipe, and further enriches the expressions of the two separate embeddings. Meanwhile, an image-recipe fusion module is devised to explore the potential relationship between fine-grained image regions and ingredients from the recipe, which jointly forms the final image-recipe similarity from both the local and global aspects. Extensive experiments on the large-scale benchmark dataset Recipe1M show that our model significantly outperforms the state-of-the-art approaches on various image-recipe retrieval scenarios.
CITATION STYLE
Li, J., Xu, X., Yu, W., Shen, F., Cao, Z., Zuo, K., & Shen, H. T. (2021). Hybrid Fusion with Intra- And Cross-Modality Attention for Image-Recipe Retrieval. In SIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 244–254). Association for Computing Machinery, Inc. https://doi.org/10.1145/3404835.3462965
Mendeley helps you to discover research relevant for your work.