Project Architecture Overview
This project adopts a typical multimodal fusion architecture, integrating three different modalities of data into a unified prediction framework. The system design reflects modularity and scalability—each modality has an independent feature extraction path, and information is finally integrated through a fusion layer.
Main components:
- dataset.py: Responsible for data configuration, data frame preparation, FastText text encoding, image transformation, and Dataset/DataLoader implementation
- utils.py: Contains model architecture definition, training loop, validation logic, inference interface, and error analysis tools
- sprint_4.ipynb: Used for EDA, model experiments, training, and result visualization
Multimodal Feature Extraction Mechanism
Visual Modality: Image Feature Extraction
Uses pre-trained models from the timm library to extract dish image features, capturing visual cues such as appearance, color, and texture to help identify dish types and ingredient proportions.
Text Modality: Ingredient Description Encoding
Uses the FastText model to convert textual descriptions of ingredients into sentence vectors, leveraging subword information to handle out-of-vocabulary words, capturing semantic relationships, and providing semantic support for ingredient types, cooking methods, etc.
Numerical Modality: Weight Information Processing
Processes the total weight of the dish through an independent lightweight encoder as a direct numerical feature, complementing visual and text features to solve the problem of different calories of similar dishes due to weight differences.