Zing Forum

Reading

VLM2VLA and Catastrophic Forgetting: Research on Knowledge Retention of Vision-Language Models in Autonomous Driving

A study addressing the catastrophic forgetting issue of vision-language models (VLMs) during fine-tuning for autonomous driving. By representing driving actions as natural language, it achieves lightweight fine-tuning using only LoRA, enabling the model to gain action capabilities while preserving its general reasoning abilities.

灾难性遗忘视觉语言模型自动驾驶LoRA微调VLM2VLA动作表示知识保持迁移学习
Published 2026-05-09 16:07Recent activity 2026-05-09 16:28Estimated read 7 min
VLM2VLA and Catastrophic Forgetting: Research on Knowledge Retention of Vision-Language Models in Autonomous Driving
1

Section 01

[Overview] VLM2VLA and Catastrophic Forgetting: Research on Knowledge Retention of Vision-Language Models in Autonomous Driving

This study focuses on the catastrophic forgetting problem of vision-language models (VLMs) during fine-tuning for autonomous driving. The core innovation is representing low-level driving actions as natural language descriptions instead of traditional numerical labels, and using LoRA for lightweight fine-tuning. This approach enables the model to acquire driving action prediction capabilities while effectively preserving its general reasoning, semantic understanding, and language abilities, providing a new idea for VLA model training in the autonomous driving field.

2

Section 02

Research Background and Problem Definition

Vision-language models (VLMs) excel in general visual understanding and natural language reasoning, but when fine-tuned for autonomous driving action prediction, they suffer from catastrophic forgetting—the model loses general reasoning, semantic understanding, and language abilities while learning to generate driving actions. Existing mainstream VLA models (e.g., EMMA, OpenDriveVLA) use numerical labels + full fine-tuning, leading to severe forgetting; dual-system approaches (e.g., Senna) still require full fine-tuning and only moderately alleviate forgetting.

3

Section 03

Core Innovations and System Architecture

Core Innovations

Extend the VLM2VLA paradigm by representing driving actions as natural language (e.g., "Decelerate to 30 km/h, maintain current lane...") instead of traditional numerical labels (e.g., <waypoint:0.23,-0.11,0.87>). Advantages include distribution consistency, lightweight fine-tuning, and knowledge retention.

System Architecture

  1. VLM Backbone: Use open-source VLMs like Gemma-3/LLaVA, fine-tune only via LoRA adapters while keeping original parameters unchanged.
  2. Action Linguification Module: Convert numerical actions into natural language to connect driving data with the VLM.
  3. Lightweight Action Decoder: Convert natural language into control commands such as waypoints/trajectories; trained independently without affecting the VLM backbone.
4

Section 04

Experimental Design and Evaluation Framework

Dataset

Use nuScenes (multimodal scenes) and Waymo Open Dataset (large-scale high-quality).

Evaluation Metrics

  • Driving Performance: L2 displacement error, collision rate, route completion rate.
  • General Ability Retention: MMMU (multimodal reasoning), MMStar (vision-language benchmark), VQA benchmarks, compared with the original VLM before fine-tuning.

Ablation Experiments

Three configurations are designed to verify component contributions:

Configuration Action Format Fine-tuning Method
Baseline Numerical Label Full Fine-tuning
Ablation 1 Numerical Label LoRA
Ablation 2 (This Project) Natural Language LoRA
5

Section 05

Method Comparison and Advantages

Method Action Format Fine-tuning Method Catastrophic Forgetting Degree
Standard VLA (EMMA, OpenDriveVLA) Numerical Label Full Fine-tuning ✅ Severe
Dual-system VLA (Senna) Mixed Format Full Fine-tuning ✅ Moderate
This Project's Scheme Natural Language LoRA Only ❌ Minimized

By changing the action representation and fine-tuning strategy, this scheme achieves task adaptation while preserving the model's general capabilities.

6

Section 06

Research Significance and Outlook

Research Significance

  1. Autonomous Driving Field: Provides a new idea for VLA training without sacrificing general capabilities.
  2. Catastrophic Forgetting Research: Alleviates mismatch from the perspective of data distribution and provides a practical solution.
  3. Lightweight Fine-tuning: Verifies the feasibility of LoRA in complex autonomous driving scenarios.

Outlook

This study provides insights for neural network transfer learning: changing data representation to match the pre-training distribution can alleviate catastrophic forgetting. As large models are increasingly applied in vertical fields, such research will become more important.