Zing Forum

Reading

Multi-Expert Debate Framework: An Innovative Approach to Enabling Large Models to Think Like a Committee

The multi-model project proposes a multi-expert debate architecture that replaces the traditional chain of thought. By having three expert roles with different perspectives conduct internal debates before providing an answer, it significantly improves reasoning diversity and RLVR training effectiveness.

大模型推理多专家系统Qwen3LoRARLVR数学推理多样性
Published 2026-04-27 06:40Recent activity 2026-04-27 07:22Estimated read 6 min
Multi-Expert Debate Framework: An Innovative Approach to Enabling Large Models to Think Like a Committee
1

Section 01

Multi-Expert Debate Framework: An Innovative Approach to Enabling Large Models to Think Like a Committee

The multi-model project proposes a multi-expert debate architecture that replaces the traditional chain of thought. By having three expert roles with different perspectives conduct internal debates before providing an answer, it significantly improves reasoning diversity and RLVR training effectiveness. This architecture is fine-tuned based on the Qwen3 model, with controllable training costs, providing a new direction for exploring the reasoning mechanisms of large models.

2

Section 02

Background: Limitations of Single Chain of Thought and the Emergence of New Ideas

Current mainstream large model reasoning enhancement technologies rely on chain of thought (e.g., Qwen3's thinking mode), but they are essentially single-model linear reasoning, which has problems of mindset and path dependence. The multi-expert debate architecture simulates the discussion process of a human committee, aiming to break the limitations of a single perspective and generate more diverse reasoning paths.

3

Section 03

Methodology: Core Mechanisms and Quantification Methods of the Multi-Expert Debate Framework

Architecture Design

Based on the Qwen3-30B-A3B-Base model, fine-tuned with LoRA rank-32, replacing the standard thinking block with a multi-expert debate block: the model sequentially plays three expert roles, analyzes the problem from their respective perspectives, supplements and questions each other, then synthesizes the opinions. The computational overhead is 4-5 orders of magnitude lower than Qwen's official thinking baseline, making it a lightweight exploration.

Diversity Quantification

Using the all-mpnet-base-v2 model to encode the first 2000 characters of reasoning into vectors, and calculating pairwise cosine distances:

  • MATH-500 dataset: The reasoning trajectory of multi-expert debate is 78.2% farther on average than Qwen3-thinking
  • AIME 24+25 dataset: The gap is 75.6% This confirms that it generates more semantically dispersed reasoning paths.
4

Section 04

Evidence: Benchmark Testing and RLVR Training Effect Verification

Benchmark Performance

  • MATH-500 L5 average hit rate: Multi-expert 0.58 vs Qwen3-thinking 0.75
  • MATH-500 L5 pass@4: Multi-expert 0.90 vs Qwen3-thinking 1.00
  • AIME 24+25 pass@1: Multi-expert 0.23 vs Qwen3-thinking 0.73
  • AIME 24+25 pass@16: Multi-expert 0.55 vs Qwen3-thinking 0.75 Single-sample metrics are behind, but the gap narrows as the number of samples increases (the gap in AIME from k=1 to k=16 shrinks by 15.6 percentage points).

RLVR Training Value

The multi-expert framework generates 1.83 times more variance-band problems than Qwen3-thinking in 877 Olympiad questions (382 vs 209). After 100 steps of LoRA RL training on these problems, the accuracy of the shared retention set increased from 14% to 29%, and the training cost is only a small part of the official baseline.

5

Section 05

Training Process: Reproduction Path and Resource Description

The project provides a complete reproduction path, divided into two phases:

  1. GSM8K warm-up: 80 steps of RL, approximately 2 hours
  2. MATH continuation training: 128 steps, approximately 6 hours Afterwards, evaluation can be done on MATH-500 and AIME, and the diversity analysis script can be run.

Training uses the Tinker platform, passing checkpoint URIs via environment variables to avoid hard-coding account session IDs in the codebase.

6

Section 06

Limitations and Future: Current Shortcomings and Follow-up Research Directions

The project's limitations include:

  • Embedding-based diversity metrics may not fully capture differences in reasoning quality
  • The AIME test set has a small sample size
  • No baseline comparison with matching token budgets
  • No matching RL training for the thinking mode These open issues point the way for future research.
7

Section 07

Conclusion: Value and Insights of the Multi-Expert Debate Framework

The multi-model project significantly improves reasoning diversity and RL training effectiveness at low training costs by changing the reasoning structure (from single linear thinking to multi-role debate). This idea has important implications for exploring the reasoning mechanisms of large models and improving the ability to solve complex tasks.