Reading

VLM2VLA and Catastrophic Forgetting: Research on Knowledge Retention of Vision-Language Models in Autonomous Driving

A study addressing the catastrophic forgetting issue of vision-language models (VLMs) during fine-tuning for autonomous driving. By representing driving actions as natural language, it achieves lightweight fine-tuning using only LoRA, enabling the model to gain action capabilities while preserving its general reasoning abilities.

灾难性遗忘视觉语言模型自动驾驶LoRA微调VLM2VLA动作表示知识保持迁移学习

Published 2026-05-09 16:07Recent activity 2026-05-09 16:28Estimated read 7 min

Section 01

[Overview] VLM2VLA and Catastrophic Forgetting: Research on Knowledge Retention of Vision-Language Models in Autonomous Driving

This study focuses on the catastrophic forgetting problem of vision-language models (VLMs) during fine-tuning for autonomous driving. The core innovation is representing low-level driving actions as natural language descriptions instead of traditional numerical labels, and using LoRA for lightweight fine-tuning. This approach enables the model to acquire driving action prediction capabilities while effectively preserving its general reasoning, semantic understanding, and language abilities, providing a new idea for VLA model training in the autonomous driving field.

Section 02

Research Background and Problem Definition

Vision-language models (VLMs) excel in general visual understanding and natural language reasoning, but when fine-tuned for autonomous driving action prediction, they suffer from catastrophic forgetting—the model loses general reasoning, semantic understanding, and language abilities while learning to generate driving actions. Existing mainstream VLA models (e.g., EMMA, OpenDriveVLA) use numerical labels + full fine-tuning, leading to severe forgetting; dual-system approaches (e.g., Senna) still require full fine-tuning and only moderately alleviate forgetting.

Section 03

Core Innovations and System Architecture

Core Innovations

Extend the VLM2VLA paradigm by representing driving actions as natural language (e.g., "Decelerate to 30 km/h, maintain current lane...") instead of traditional numerical labels (e.g., <waypoint:0.23,-0.11,0.87>). Advantages include distribution consistency, lightweight fine-tuning, and knowledge retention.

System Architecture

VLM Backbone: Use open-source VLMs like Gemma-3/LLaVA, fine-tune only via LoRA adapters while keeping original parameters unchanged.
Action Linguification Module: Convert numerical actions into natural language to connect driving data with the VLM.
Lightweight Action Decoder: Convert natural language into control commands such as waypoints/trajectories; trained independently without affecting the VLM backbone.

Section 04

Experimental Design and Evaluation Framework

Dataset

Use nuScenes (multimodal scenes) and Waymo Open Dataset (large-scale high-quality).

Evaluation Metrics

Driving Performance: L2 displacement error, collision rate, route completion rate.
General Ability Retention: MMMU (multimodal reasoning), MMStar (vision-language benchmark), VQA benchmarks, compared with the original VLM before fine-tuning.

Ablation Experiments

Three configurations are designed to verify component contributions:

Configuration	Action Format	Fine-tuning Method
Baseline	Numerical Label	Full Fine-tuning
Ablation 1	Numerical Label	LoRA
Ablation 2 (This Project)	Natural Language	LoRA

Section 05

Method Comparison and Advantages

Method	Action Format	Fine-tuning Method	Catastrophic Forgetting Degree
Standard VLA (EMMA, OpenDriveVLA)	Numerical Label	Full Fine-tuning	✅ Severe
Dual-system VLA (Senna)	Mixed Format	Full Fine-tuning	✅ Moderate
This Project's Scheme	Natural Language	LoRA Only	❌ Minimized

By changing the action representation and fine-tuning strategy, this scheme achieves task adaptation while preserving the model's general capabilities.

Section 06

Research Significance and Outlook

Research Significance

Autonomous Driving Field: Provides a new idea for VLA training without sacrificing general capabilities.
Catastrophic Forgetting Research: Alleviates mismatch from the perspective of data distribution and provides a practical solution.
Lightweight Fine-tuning: Verifies the feasibility of LoRA in complex autonomous driving scenarios.

Outlook

This study provides insights for neural network transfer learning: changing data representation to match the pre-training distribution can alleviate catastrophic forgetting. As large models are increasingly applied in vertical fields, such research will become more important.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15