Reading

SIMMER: A New Method for Cross-Modal Retrieval Between Food Images and Recipes Based on Multimodal Large Language Models

跨模态检索多模态大语言模型食物图像食谱推荐SIMMER统一编码器VLM2Vec

Published 2026-04-17 10:09Recent activity 2026-04-20 10:20Estimated read 6 min

SIMMER: A New Method for Cross-Modal Retrieval Between Food Images and Recipes Based on Multimodal Large Language Models

Section 01

【Introduction】SIMMER: A Breakthrough New Method for Cross-Modal Retrieval Between Food Images and Recipes

This paper proposes the SIMMER framework, which uses a single multimodal encoder instead of the traditional dual-encoder architecture, achieving a breakthrough in image-to-recipe retrieval R@1 from 81.8% to 87.5% on the Recipe1M dataset. This method addresses issues such as semantic gaps and task-specific design in traditional cross-modal retrieval, providing a new paradigm for cross-modal retrieval between food images and recipe texts.

Section 02

Background: Application Value of Cross-Modal Retrieval and Limitations of Traditional Methods

Application Value of Cross-Modal Retrieval

In digital life, cross-modal retrieval between food images and recipe texts can meet needs such as dish replication, nutrition management, and cooking assistance—for example, taking photos of ingredients to find recipes, or intelligent menu management for catering enterprises.

Limitations of Traditional Dual-Encoder Architecture

Semantic Gap: Independent image and text encoders make it difficult to unify the representation space;
Task-Specific Design: Requires customizing networks for different tasks, leading to high development costs;
Insufficient Fine-Grained Association: Difficult to capture detailed matches such as ingredients and cooking methods.

Section 03

Core Innovation of SIMMER: Single Unified Encoder Architecture

SIMMER (Single Integrated Multimodal Model for Embedding Recipes) uses VLM2Vec as the base multimodal large language model, encoding food images into visual tokens, which are input together with recipe text tokens into a single encoder to generate unified embedding vectors, fundamentally eliminating the semantic gap problem of dual encoders.

Section 04

Structured Prompt Design for Recipe Structure

Recipes consist of three core components: title, ingredients, and steps. SIMMER designs specialized prompt templates:

Image input prompts guide attention to visual features (color, texture, shape) and cooking methods;
Text input prompts clearly distinguish between title, ingredients, and steps levels, helping the model understand recipe structure and generate more semantically rich embeddings.

Section 05

Component-Aware Data Augmentation Strategy

To improve robustness to incomplete inputs, SIMMER uses component-aware augmentation: during training, it processes complete recipes and various partial combinations (title only, title + ingredients, etc.), enabling the model to extract semantics from limited information fragments and handle scenarios with incomplete recipe information in real-world applications.

Section 06

Experimental Evidence: Significant Performance Improvement on the Recipe1M Dataset

In the evaluation on the Recipe1M dataset:

1k setting: Image-to-recipe retrieval R@1 reaches 87.5%, an improvement of 5.7 percentage points over the previous best;
10k setting: R@1 jumps from 56.5% to 65.5%, an improvement of 9 percentage points;
All metrics surpass the baseline, proving the superiority of the single encoder architecture and multimodal large language models.

Section 07

Conclusion and Application Prospects

Technical Insights

The unified encoder architecture eliminates semantic gaps and can be extended to other cross-modal tasks;
Structured prompts improve performance in specific domains;
Component-aware augmentation enhances robustness in practical applications.

Application Scenarios

Smart kitchen assistants, catering nutrition analysis, social media food discovery, intelligent management for catering enterprises, etc.

Conclusion

SIMMER represents an important breakthrough in the field of food cross-modal retrieval, laying the foundation for practical applications. In the future, it will promote more intelligent human-computer interaction services to enhance a better life.