# TARS: Bridging the Reasoning Gap of Speech Large Models with Reinforcement Learning

> TARS effectively addresses the problem that speech large models are far weaker than text models in reasoning tasks through asymmetric reward design and trajectory alignment technology, achieving the best performance among 7B-scale models on benchmarks like MMSU and OBQA.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-17T14:11:49.000Z
- 最近活动: 2026-04-17T14:18:38.469Z
- 热度: 148.9
- 关键词: 语音大模型, 强化学习, 多模态推理, GRPO, 表征对齐, Speech LLM, ACL 2026
- 页面链接: https://www.zingnex.cn/en/forum/thread/tars
- Canonical: https://www.zingnex.cn/forum/thread/tars
- Markdown 来源: floors_fallback

---

## TARS: Bridging the Reasoning Gap of Speech Large Models with Reinforcement Learning (Introduction)

Speech Large Language Models (Speech LLMs) perform far worse than text models in complex reasoning tasks, resulting in a 'modal reasoning gap'. The TARS (Trajectory Alignment for Reasoning in Speech) proposed by the Amphion team at ACL 2026 effectively solves this problem through asymmetric reward design and trajectory alignment technology, achieving the best performance among 7B-scale models on benchmarks like MMSU and OBQA.

## Root Causes: Representation Drift and Behavioral Bias

The internal mechanisms behind the insufficient reasoning ability of speech large models are mainly twofold: 1. **Representation Drift**: In the multi-layer structure of Transformers, the hidden states of the speech modality deviate from the corresponding representations of the text modality as the number of layers increases, making it difficult to reuse text reasoning patterns; 2. **Behavioral Bias**: During long-chain reasoning, the responses generated under speech conditions are semantically inconsistent with the reference text responses, leading to the divergence of reasoning paths and a decline in answer quality.

## Core Method: Asymmetric Trajectory Alignment

The core innovation of TARS is the **asymmetric reward design**, which treats the text modality as a dynamic reference frame and allows the speech modality to co-evolve with the optimized text reasoning trajectory. It includes two dense reward signals: 1. **Representation Alignment**: Calculate the cosine similarity between the hidden states of each Transformer layer in the speech and text trajectories to minimize representation drift; 2. **Behavioral Alignment**: Use Qwen3-Embedding-0.6B to evaluate the semantic consistency between the generated output and the reference text, guiding the reasoning behavior of the speech model to align with the text.

## Technical Implementation: GRPO Training Framework

TARS uses **Group Relative Policy Optimization (GRPO)** as the core training algorithm, which can learn from sparse rewards and self-explore better reasoning strategies. The project is built based on the ms-swift framework, supports distributed training, and its process includes three stages: data construction, preference pair generation, and reinforcement learning. The team has open-sourced the complete MMLU training dataset (including synthetic audio) to facilitate community reproduction.

## Experimental Results: Best Performance Among 7B-Scale Models

On reasoning benchmarks such as MMSU (Multimodal Multiple-Choice Understanding) and OBQA (Open-domain Question Answering), TARS shows significant performance: compared to baseline models, the speech reasoning accuracy is greatly improved; it achieves the best level among 7B-scale Speech LLMs; at the same time, it maintains the original capabilities of the text modality without performance degradation. This proves that the asymmetric alignment strategy is effective—speech does not need to completely imitate text, and can be co-optimized with the text reasoning trajectory.

## Open-Source Ecosystem: Model Weights and Resource Release

The TARS team has open-sourced the complete model weights based on Qwen2.5-Omni-7B (HuggingFace address: yuantuo666/TARS-Qwen2.5-Omni-7B). The code repository includes training scripts, evaluation tools, and reasoning examples, supporting mainstream architectures like Phi-4-Multimodal. Reproduction requires at least 1 A100 (80GB) for inference and 8 A100s for distributed training. The project provides environment configuration and dataset construction guidelines.

## Insights and Outlook: A New Path for Multimodal Intelligence

The success of TARS shows that the modal gap can be crossed through appropriate alignment strategies. The asymmetric reward design breaks the traditional 'text teacher-speech student' paradigm and creates a co-evolution path. In the future, this idea is expected to be extended to more modal combinations such as vision-speech and video-audio, promoting the development of unified multimodal intelligence and providing cutting-edge technical support for end-to-end speech interaction.
