Reading

TARS: Bridging the Reasoning Gap of Speech Large Models with Reinforcement Learning

TARS effectively addresses the problem that speech large models are far weaker than text models in reasoning tasks through asymmetric reward design and trajectory alignment technology, achieving the best performance among 7B-scale models on benchmarks like MMSU and OBQA.

语音大模型强化学习多模态推理GRPO表征对齐Speech LLMACL 2026

Published 2026-04-17 22:11Recent activity 2026-04-17 22:18Estimated read 7 min

Section 01

TARS: Bridging the Reasoning Gap of Speech Large Models with Reinforcement Learning (Introduction)

Speech Large Language Models (Speech LLMs) perform far worse than text models in complex reasoning tasks, resulting in a 'modal reasoning gap'. The TARS (Trajectory Alignment for Reasoning in Speech) proposed by the Amphion team at ACL 2026 effectively solves this problem through asymmetric reward design and trajectory alignment technology, achieving the best performance among 7B-scale models on benchmarks like MMSU and OBQA.

Section 02

Root Causes: Representation Drift and Behavioral Bias

The internal mechanisms behind the insufficient reasoning ability of speech large models are mainly twofold: 1. Representation Drift: In the multi-layer structure of Transformers, the hidden states of the speech modality deviate from the corresponding representations of the text modality as the number of layers increases, making it difficult to reuse text reasoning patterns; 2. Behavioral Bias: During long-chain reasoning, the responses generated under speech conditions are semantically inconsistent with the reference text responses, leading to the divergence of reasoning paths and a decline in answer quality.

Section 03

Core Method: Asymmetric Trajectory Alignment

The core innovation of TARS is the asymmetric reward design, which treats the text modality as a dynamic reference frame and allows the speech modality to co-evolve with the optimized text reasoning trajectory. It includes two dense reward signals: 1. Representation Alignment: Calculate the cosine similarity between the hidden states of each Transformer layer in the speech and text trajectories to minimize representation drift; 2. Behavioral Alignment: Use Qwen3-Embedding-0.6B to evaluate the semantic consistency between the generated output and the reference text, guiding the reasoning behavior of the speech model to align with the text.

Section 04

Technical Implementation: GRPO Training Framework

TARS uses Group Relative Policy Optimization (GRPO) as the core training algorithm, which can learn from sparse rewards and self-explore better reasoning strategies. The project is built based on the ms-swift framework, supports distributed training, and its process includes three stages: data construction, preference pair generation, and reinforcement learning. The team has open-sourced the complete MMLU training dataset (including synthetic audio) to facilitate community reproduction.

Section 05

Experimental Results: Best Performance Among 7B-Scale Models

On reasoning benchmarks such as MMSU (Multimodal Multiple-Choice Understanding) and OBQA (Open-domain Question Answering), TARS shows significant performance: compared to baseline models, the speech reasoning accuracy is greatly improved; it achieves the best level among 7B-scale Speech LLMs; at the same time, it maintains the original capabilities of the text modality without performance degradation. This proves that the asymmetric alignment strategy is effective—speech does not need to completely imitate text, and can be co-optimized with the text reasoning trajectory.

Section 06

Open-Source Ecosystem: Model Weights and Resource Release

The TARS team has open-sourced the complete model weights based on Qwen2.5-Omni-7B (HuggingFace address: yuantuo666/TARS-Qwen2.5-Omni-7B). The code repository includes training scripts, evaluation tools, and reasoning examples, supporting mainstream architectures like Phi-4-Multimodal. Reproduction requires at least 1 A100 (80GB) for inference and 8 A100s for distributed training. The project provides environment configuration and dataset construction guidelines.

Section 07

Insights and Outlook: A New Path for Multimodal Intelligence

The success of TARS shows that the modal gap can be crossed through appropriate alignment strategies. The asymmetric reward design breaks the traditional 'text teacher-speech student' paradigm and creates a co-evolution path. In the future, this idea is expected to be extended to more modal combinations such as vision-speech and video-audio, promoting the development of unified multimodal intelligence and providing cutting-edge technical support for end-to-end speech interaction.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15