# VLM2VLA and Catastrophic Forgetting: Research on Knowledge Retention of Vision-Language Models in Autonomous Driving

> A study addressing the catastrophic forgetting issue of vision-language models (VLMs) during fine-tuning for autonomous driving. By representing driving actions as natural language, it achieves lightweight fine-tuning using only LoRA, enabling the model to gain action capabilities while preserving its general reasoning abilities.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-09T08:07:53.000Z
- 最近活动: 2026-05-09T08:28:23.209Z
- 热度: 141.7
- 关键词: 灾难性遗忘, 视觉语言模型, 自动驾驶, LoRA微调, VLM2VLA, 动作表示, 知识保持, 迁移学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/vlm2vla
- Canonical: https://www.zingnex.cn/forum/thread/vlm2vla
- Markdown 来源: floors_fallback

---

## [Overview] VLM2VLA and Catastrophic Forgetting: Research on Knowledge Retention of Vision-Language Models in Autonomous Driving

This study focuses on the catastrophic forgetting problem of vision-language models (VLMs) during fine-tuning for autonomous driving. The core innovation is representing low-level driving actions as natural language descriptions instead of traditional numerical labels, and using LoRA for lightweight fine-tuning. This approach enables the model to acquire driving action prediction capabilities while effectively preserving its general reasoning, semantic understanding, and language abilities, providing a new idea for VLA model training in the autonomous driving field.

## Research Background and Problem Definition

Vision-language models (VLMs) excel in general visual understanding and natural language reasoning, but when fine-tuned for autonomous driving action prediction, they suffer from **catastrophic forgetting**—the model loses general reasoning, semantic understanding, and language abilities while learning to generate driving actions. Existing mainstream VLA models (e.g., EMMA, OpenDriveVLA) use numerical labels + full fine-tuning, leading to severe forgetting; dual-system approaches (e.g., Senna) still require full fine-tuning and only moderately alleviate forgetting.

## Core Innovations and System Architecture

### Core Innovations
Extend the VLM2VLA paradigm by representing driving actions as natural language (e.g., "Decelerate to 30 km/h, maintain current lane...") instead of traditional numerical labels (e.g., `<waypoint:0.23,-0.11,0.87>`). Advantages include distribution consistency, lightweight fine-tuning, and knowledge retention.

### System Architecture
1. **VLM Backbone**: Use open-source VLMs like Gemma-3/LLaVA, fine-tune only via LoRA adapters while keeping original parameters unchanged.
2. **Action Linguification Module**: Convert numerical actions into natural language to connect driving data with the VLM.
3. **Lightweight Action Decoder**: Convert natural language into control commands such as waypoints/trajectories; trained independently without affecting the VLM backbone.

## Experimental Design and Evaluation Framework

### Dataset
Use nuScenes (multimodal scenes) and Waymo Open Dataset (large-scale high-quality).

### Evaluation Metrics
- **Driving Performance**: L2 displacement error, collision rate, route completion rate.
- **General Ability Retention**: MMMU (multimodal reasoning), MMStar (vision-language benchmark), VQA benchmarks, compared with the original VLM before fine-tuning.

### Ablation Experiments
Three configurations are designed to verify component contributions:
| Configuration | Action Format | Fine-tuning Method |
|---|---|---|
| Baseline | Numerical Label | Full Fine-tuning |
| Ablation 1 | Numerical Label | LoRA |
| Ablation 2 (This Project) | Natural Language | LoRA |

## Method Comparison and Advantages

| Method | Action Format | Fine-tuning Method | Catastrophic Forgetting Degree |
|---|---|---|---|
| Standard VLA (EMMA, OpenDriveVLA) | Numerical Label | Full Fine-tuning | ✅ Severe |
| Dual-system VLA (Senna) | Mixed Format | Full Fine-tuning | ✅ Moderate |
| This Project's Scheme | Natural Language | LoRA Only | ❌ Minimized |

By changing the action representation and fine-tuning strategy, this scheme achieves task adaptation while preserving the model's general capabilities.

## Research Significance and Outlook

### Research Significance
1. **Autonomous Driving Field**: Provides a new idea for VLA training without sacrificing general capabilities.
2. **Catastrophic Forgetting Research**: Alleviates mismatch from the perspective of data distribution and provides a practical solution.
3. **Lightweight Fine-tuning**: Verifies the feasibility of LoRA in complex autonomous driving scenarios.

### Outlook
This study provides insights for neural network transfer learning: changing data representation to match the pre-training distribution can alleviate catastrophic forgetting. As large models are increasingly applied in vertical fields, such research will become more important.
