正文

MultiSmolVLA：通过模态丢弃训练增强VLA模型的多传感器鲁棒性

MultiSmolVLA项目通过将4M-21多模态编码器与SmolVLA结合，并引入模态丢弃训练策略，显著提升了视觉-语言-动作模型在传感器故障场景下的鲁棒性，为机器人应用提供了更可靠的感知方案。

MultiSmolVLAVLA模型多模态感知机器人模态丢弃鲁棒性4M-21SmolVLAEPFL视觉语言动作

发布时间 2026/04/22 17:43最近活动 2026/04/22 17:51预计阅读 4 分钟

章节 01

MultiSmolVLA: Enhancing VLA Model Robustness for Robots via Modality Dropout Training

EPFL's MultiSmolVLA project addresses the fragility of single-RGB VLA models in real-world robot scenarios. It combines the 4M-21 multi-modal encoder with SmolVLA and introduces a modality dropout training strategy to boost robustness against sensor failures, aiming to provide more reliable perception solutions for robot applications.

章节 02

The Vulnerability of Single-Modal VLA Models in Real-World Deployment

Current VLA models like π0 and OpenVLA rely on RGB input but face performance drops in real scenarios: sensor failures (hardware issues), environmental interference (glare, smoke), and occlusions. These lead to catastrophic failures, posing safety risks for robots.

章节 03

Key Innovations of MultiSmolVLA: Architecture & Training Strategy

Architecture: Replaces SmolVLA's SigLIP encoder with Apple's 4M-21 multi-modal encoder, which fuses RGB, depth, semantic segmentation, and thermal modalities into a unified token sequence. Training: Uses a progressive modality dropout curriculum—zero dropout in connector alignment phase, then linear increase to 0.5 in robustness fine-tuning—to teach the model to adapt to missing modalities.

章节 04

Technical Implementation: Data Synthesis, Training Flow & Evaluation Setup

Thermal Data: Uses ThermalGen (diffusion model) to synthesize thermal images from RGB, converted via ImageBind to 4M-21-compatible embeddings. Two-Stage Training: 1) Train MLP connector (4M-21 → SmolLM2 space) with no dropout; 2) LoRA fine-tune SmolLM2 and action expert with increasing dropout. Dataset & Eval: Uses LIBERO benchmark (4 task categories: Spatial, Object, Goal, Long) with 3 test conditions: clean, hard dropout, soft corruption.

章节 05

Performance Comparisons & Ablation Analysis

Baseline results: Vanilla SmolVLA (87.3% avg task completion), Vanilla π0 (86%). Full MultiSmolVLA performance not disclosed, but ablation studies validate: 1) Impact of additional modalities vs RGB-only; 2) Effectiveness of curriculum dropout vs fixed dropout.

章节 06

Technical Significance & Real-World Applications

Contributions: 1) Shifts focus to 'performance under failure' for VLA models; 2) Demonstrates assembly-style innovation (combining existing components);3) Curricula dropout can be applied to medical imaging, autonomous driving;4) Open-sourced code/evaluation for community.

章节 07

Limitations & Future Research Directions

Limitations: Synthetic thermal data may differ from real; higher computation cost vs single RGB; no discussion on sensor synchronization. Future: Explore efficient fusion (cross-attention), adaptive modality selection, extend robustness to adversarial attacks/calibration errors.