Reading

V2X-QA: A Multimodal Large Model Reasoning Dataset and Benchmark for V2X Cooperative Autonomous Driving

V2X-QA is a real-scenario-based multi-view autonomous driving visual question answering dataset that supports controlled evaluation from three perspectives (vehicle-side, infrastructure-side, and cooperative), and it also releases a MoE baseline model V2X-MoE based on Qwen3-VL.

V2X自动驾驶多模态大模型车路协同视觉问答数据集Qwen3-VLMoE

Published 2026-04-06 11:12Recent activity 2026-04-06 11:18Estimated read 6 min

V2X-QA: A Multimodal Large Model Reasoning Dataset and Benchmark for V2X Cooperative Autonomous Driving

Section 01

[Introduction] Overview of V2X-QA Dataset and V2X-MoE Baseline Model

V2X-QA is a multimodal large model reasoning dataset and benchmark for V2X cooperative autonomous driving, built on real scenarios and supporting controlled evaluation from three perspectives: vehicle-side, infrastructure-side, and cooperative. The project simultaneously releases a MoE baseline model V2X-MoE based on Qwen3-VL, providing a new evaluation dimension for the application of multimodal large models in the autonomous driving field.

Section 02

Project Background and Core Positioning

Autonomous driving is shifting from single-vehicle intelligence to vehicle-infrastructure cooperation (V2X), and integrating vehicle-side close-range details with infrastructure-side global perception is a key challenge. Traditional datasets mostly focus on a single perspective. V2X-QA is built based on V2X-Seq-SPD, and for the first time integrates three perspectives—vehicle-side (VS), infrastructure-side (IS), and cooperative (CO)—into a unified VQA framework, enabling precise quantification of the contribution of different information sources to model reasoning.

Section 03

Dataset Architecture and Task Design

V2X-QA includes 12 view-aligned tasks covering three levels: perception (recognizing traffic participants/signs, etc.), prediction (trajectory inference), and reasoning planning (driving decision-making). Each task has evaluation subsets for the three perspectives, and data is stored in JSONL format (including questions, options, answers, and image paths). Original images need to be downloaded separately from the official V2X-Seq-SPD channel (due to license restrictions).

Section 04

V2X-MoE Baseline Model Design

V2X-MoE is a MoE model based on Qwen3-VL, using an explicit view routing mechanism and containing three LoRA expert modules for vehicle-side, infrastructure-side, and cooperative perspectives. During inference, the corresponding expert is activated according to the question's perspective, avoiding performance degradation of a single model adapting to multiple distributions. Training is divided into three stages: joint MCQA training → cooperative view fine-tuning → infrastructure-side view enhancement, ensuring a balance between general and specialized capabilities.

Section 05

Technical Implementation and Reproduction Guide

The project provides complete training/evaluation scripts (supporting Conda/venv environments). Training scripts are in the model/train/ directory (corresponding to the three stages), and the evaluation script v2x_moe_eval_mcqa_qwen3.py can directly load pre-trained checkpoints. The checkpoints include three expert LoRA weights and configurations, so users can reproduce results without training from scratch. Note: Annotation files, scripts, and checkpoints are maintained by the project; original images and base models need to be obtained in accordance with upstream agreements.

Section 06

Research Value and Application Prospects

V2X-QA fills the gap in the evaluation of V2X cooperative multimodal large models. Compared to traditional datasets, it emphasizes higher-level reasoning (understanding scenarios and making decisions), which aligns with the development trend of multimodal large models. In applications, it provides a standardized testing platform for algorithm iteration of V2X cooperative systems, and can evaluate vehicle-side perception optimization, roadside deployment strategies, cloud fusion algorithms, etc.

Section 07

Summary and Outlook

V2X-QA provides important infrastructure for V2X cooperative research through its multi-view VQA dataset and MoE baseline model. Its controlled evaluation design, modular architecture, and open-source implementation reflect a deep understanding of domain needs. For researchers/engineers, it is both a benchmark tool and a reference framework, and we look forward to the project's continuous iteration to contribute to the industry's development.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15