Zing Forum

Reading

FoodSense: A Multimodal Dataset and Benchmark Model for Predicting Multisensory Experiences from Food Images

This article introduces the FoodSense dataset, which contains 66,842 human-annotated entries supporting the prediction of taste, smell, texture, and sound from food images, and trains the FoodSense-VL vision-language model to enable multisensory reasoning.

多感官感知食物图像理解视觉语言模型跨模态推理FoodSense认知科学多模态数据集
Published 2026-04-16 04:02Recent activity 2026-04-20 10:18Estimated read 5 min
FoodSense: A Multimodal Dataset and Benchmark Model for Predicting Multisensory Experiences from Food Images
1

Section 01

[Introduction] FoodSense: Innovative Research Connecting Food Images and Multisensory Experiences

This article introduces the FoodSense dataset (containing 66,842 human-annotated entries covering four sensory dimensions: taste, smell, texture, and sound), aiming to fill the gap in AI food understanding where deep cognitive awareness of sensory experiences is lacking; it trains the FoodSense-VL vision-language model to enable multisensory reasoning and discusses its application scenarios and cognitive science significance.

2

Section 02

[Background] Cognitive Science of Cross-Sensory Perception and Limitations of Existing Research

Humans can evoke multisensory experiences through food images (cross-sensory perception in cognitive science), but current AI food research is limited to recognition tasks (dish category, ingredient composition, nutrition estimation), lacking deep cognition of food sensory experiences, leading to superficial understanding.

3

Section 03

[Method] Construction Details of the FoodSense Dataset

The FoodSense dataset contains 66,842 participant-image pairs, covering 2,987 unique food images; the annotation design includes numerical scores (1-5 Likert scale to quantify sensory intensity) plus free-text descriptions (to capture subtle experiences), covering four sensory dimensions; the data covers different cultures and cooking styles to ensure the model's generalization ability.

4

Section 04

[Method] Inference Trajectory Generation: From Annotations to Explainable AI

Using large language models to expand short annotations into image-anchored inference trajectories that explain the basis of sensory predictions (e.g., inferring caramelized aroma from caramelized color); these trajectories link to visual content, providing rich training signals for the model and aiding explainability.

5

Section 05

[Method] FoodSense-VL Model: A Multisensory Vision-Language Benchmark Model

An end-to-end vision-language architecture is adopted, with training objectives including score prediction (regression task, mapping visual features to sensory intensity) and explanation generation (conditional text generation, integrating visual information and sensory knowledge); the two tasks collaborate: explanation generation improves score accuracy, and score prediction constrains the concreteness of explanations.

6

Section 06

[Evaluation] Reflection on Evaluation Metrics for Sensory Reasoning Tasks

Traditional captioning metrics (BLEU, CIDEr) ignore the accuracy of sensory descriptions and consistency with images; it is suggested that future evaluations focus on the consistency between descriptions and images, the accuracy of sensory attributes, and the rationality of the inference process.

7

Section 07

[Applications and Outlook] Potential Value and Future Directions of FoodSense

Application scenarios include intelligent catering recommendations (based on sensory preferences), virtual taste testing (enhancing immersion), food marketing (generating appealing descriptions), and dietary health management (combining nutrition and sensory preferences); this research bridges cognitive science and AI, and future AI may approach human-level food understanding.