Zing Forum

Reading

TARS: Bridging the Reasoning Gap of Speech Large Models with Reinforcement Learning

TARS effectively addresses the problem that speech large models are far weaker than text models in reasoning tasks through asymmetric reward design and trajectory alignment technology, achieving the best performance among 7B-scale models on benchmarks like MMSU and OBQA.

语音大模型强化学习多模态推理GRPO表征对齐Speech LLMACL 2026
Published 2026-04-17 22:11Recent activity 2026-04-17 22:18Estimated read 7 min
TARS: Bridging the Reasoning Gap of Speech Large Models with Reinforcement Learning
1

Section 01

TARS: Bridging the Reasoning Gap of Speech Large Models with Reinforcement Learning (Introduction)

Speech Large Language Models (Speech LLMs) perform far worse than text models in complex reasoning tasks, resulting in a 'modal reasoning gap'. The TARS (Trajectory Alignment for Reasoning in Speech) proposed by the Amphion team at ACL 2026 effectively solves this problem through asymmetric reward design and trajectory alignment technology, achieving the best performance among 7B-scale models on benchmarks like MMSU and OBQA.

2

Section 02

Root Causes: Representation Drift and Behavioral Bias

The internal mechanisms behind the insufficient reasoning ability of speech large models are mainly twofold: 1. Representation Drift: In the multi-layer structure of Transformers, the hidden states of the speech modality deviate from the corresponding representations of the text modality as the number of layers increases, making it difficult to reuse text reasoning patterns; 2. Behavioral Bias: During long-chain reasoning, the responses generated under speech conditions are semantically inconsistent with the reference text responses, leading to the divergence of reasoning paths and a decline in answer quality.

3

Section 03

Core Method: Asymmetric Trajectory Alignment

The core innovation of TARS is the asymmetric reward design, which treats the text modality as a dynamic reference frame and allows the speech modality to co-evolve with the optimized text reasoning trajectory. It includes two dense reward signals: 1. Representation Alignment: Calculate the cosine similarity between the hidden states of each Transformer layer in the speech and text trajectories to minimize representation drift; 2. Behavioral Alignment: Use Qwen3-Embedding-0.6B to evaluate the semantic consistency between the generated output and the reference text, guiding the reasoning behavior of the speech model to align with the text.

4

Section 04

Technical Implementation: GRPO Training Framework

TARS uses Group Relative Policy Optimization (GRPO) as the core training algorithm, which can learn from sparse rewards and self-explore better reasoning strategies. The project is built based on the ms-swift framework, supports distributed training, and its process includes three stages: data construction, preference pair generation, and reinforcement learning. The team has open-sourced the complete MMLU training dataset (including synthetic audio) to facilitate community reproduction.

5

Section 05

Experimental Results: Best Performance Among 7B-Scale Models

On reasoning benchmarks such as MMSU (Multimodal Multiple-Choice Understanding) and OBQA (Open-domain Question Answering), TARS shows significant performance: compared to baseline models, the speech reasoning accuracy is greatly improved; it achieves the best level among 7B-scale Speech LLMs; at the same time, it maintains the original capabilities of the text modality without performance degradation. This proves that the asymmetric alignment strategy is effective—speech does not need to completely imitate text, and can be co-optimized with the text reasoning trajectory.

6

Section 06

Open-Source Ecosystem: Model Weights and Resource Release

The TARS team has open-sourced the complete model weights based on Qwen2.5-Omni-7B (HuggingFace address: yuantuo666/TARS-Qwen2.5-Omni-7B). The code repository includes training scripts, evaluation tools, and reasoning examples, supporting mainstream architectures like Phi-4-Multimodal. Reproduction requires at least 1 A100 (80GB) for inference and 8 A100s for distributed training. The project provides environment configuration and dataset construction guidelines.

7

Section 07

Insights and Outlook: A New Path for Multimodal Intelligence

The success of TARS shows that the modal gap can be crossed through appropriate alignment strategies. The asymmetric reward design breaks the traditional 'text teacher-speech student' paradigm and creates a co-evolution path. In the future, this idea is expected to be extended to more modal combinations such as vision-speech and video-audio, promoting the development of unified multimodal intelligence and providing cutting-edge technical support for end-to-end speech interaction.