Zing Forum

Reading

FoodSense: A Multimodal Dataset and Benchmark Model for Predicting Multisensory Experiences from Food Images

FoodSense has built a dataset with annotations from 66,842 participants, enabling AI to predict taste, smell, texture, and sound from food images and generate visually-based explainable reasoning.

跨感官推理食物图像理解视觉语言模型多模态数据集认知科学
Published 2026-04-16 04:02Recent activity 2026-04-17 10:21Estimated read 5 min
FoodSense: A Multimodal Dataset and Benchmark Model for Predicting Multisensory Experiences from Food Images
1

Section 01

[Introduction] FoodSense Project: A Breakthrough in Enabling AI to Perceive Multisensory Experiences from Food Images

The FoodSense project aims to address the gap in AI cross-sensory reasoning. It has built a dataset with annotations from 66,842 participants, covering 2,987 food images, supporting the prediction of taste, smell, texture, and sound from visuals, and generating explainable reasoning. The trained FoodSense-VL model advances food image understanding from surface-level recognition to multisensory perception, bridging cognitive science and AI.

2

Section 02

Background: The Cognitive Gap Between Human Cross-Sensory Perception and AI

When humans see food images, they can associate multi-dimensional sensory experiences (e.g., the crispness and aroma of pizza), but current AI can only recognize surface-level semantics (e.g., "this is pizza") and cannot perceive sensory characteristics, limiting applications in scenarios like food recommendation. Thus, the FoodSense project was born.

3

Section 03

FoodSense Dataset: A Large-Scale Resource with Multisensory Annotations

The dataset contains 66,842 participant-image pairs and 2,987 images, with annotations in four dimensions:

  • Taste: 1-5 rating + free description (e.g., "sweet with a hint of sour");
  • Smell: aroma characteristics + intensity rating (e.g., "toasted bread's burnt aroma");
  • Texture: visually inferable attributes (e.g., "crisp", "soft");
  • Sound: imagined eating sounds (e.g., "the crunch of potato chips").
4

Section 04

Methodology: Data Augmentation from Annotations to Visual Reasoning Chains

Using large language models to expand short annotations into image reasoning chains, e.g., the fried chicken example: "Golden crispy exterior → high-temperature fried porous structure → crispy texture + crunch sound; golden color → Maillard reaction → burnt aroma and umami taste..." This connects cognitive science with instruction fine-tuning, providing training signals.

5

Section 05

FoodSense-VL Model: Multitask Learning and Explainable Reasoning

Model innovations:

  • Multitask learning: shared encoder + task heads to learn sensory correlations (e.g., crispy appearance → crunch sound);
  • Explanation generation: natural language descriptions of the visual basis for predictions;
  • Fine-grained perception: attention mechanism maps image regions to sensory attributes (e.g., texture → consistency).
6

Section 06

Evaluation Reflection: Limitations of Traditional Metrics

Traditional vision-language metrics (e.g., semantic correctness) cannot capture the subtleties of sensory experiences (e.g., "crisp and delicious" is equivalent to "crisp outside and tender inside"), calling for the development of perception-sensitive evaluation metrics.

7

Section 07

Application Scenarios and Future Directions

Applications: Intelligent recommendation (taste preference), virtual tasting (sensory description), cooking assistance (dish development), accessibility (sensory description for visually impaired people). Limitations and Future: The dataset has cultural differences; static images struggle to convey dynamic experiences. Future work needs to expand cultural diversity, introduce videos, and link to chemical components.

8

Section 08

Conclusion: A Bridge Between Cognitive Science and AI

FoodSense transforms human cross-sensory perception into a multimodal model, advancing food understanding from "what it is" to "how it feels", which is an important step toward human-like intelligence.