Zing Forum

Reading

Disagreement-Guided Strategy Routing: Enabling Large Models to Vote When Needed and Rewrite When Necessary

Large reasoning models show unstable performance on mathematical tasks. The new framework dynamically selects test-time expansion strategies based on output disagreement: lightweight processing for consistent samples, majority voting for moderate disagreement, and problem rewriting for high ambiguity, achieving a 3-7% accuracy improvement while reducing sampling costs.

测试时扩展大模型推理数学推理策略路由多数投票问题重写
Published 2026-04-29 21:11Recent activity 2026-04-30 10:35Estimated read 7 min
Disagreement-Guided Strategy Routing: Enabling Large Models to Vote When Needed and Rewrite When Necessary
1

Section 01

[Introduction] Disagreement-Guided Strategy Routing: Making Large Model Reasoning Smarter and More Efficient

Large reasoning models exhibit unstable performance on mathematical tasks. Existing test-time expansion strategies have problems such as high computational overhead and a one-size-fits-all approach for all instances. This study proposes a disagreement-guided strategy routing framework that dynamically selects processing strategies based on output disagreement: lightweight processing for low-disagreement instances, majority voting for moderate disagreement, and problem rewriting for high ambiguity. The framework achieves a 3-7% accuracy improvement while reducing sampling costs, and can be integrated into existing reasoning pipelines without additional training.

2

Section 02

Background: Test-Time Dilemma of Large Model Reasoning

Large Reasoning Models (LRMs) excel at complex tasks like mathematical reasoning and code generation, but their performance is extremely unstable when facing difficult instances. To improve reliability, researchers have developed test-time expansion strategies such as repeated sampling, self-correction, and tree search. While these can boost accuracy, they incur significant computational overhead, and the marginal gain for difficult problems diminishes. The core issue is that existing methods apply the same strategy to all instances without adapting to their difficulty.

3

Section 03

Core Insight: Disagreement Degree is a Key Signal for Difficulty and Correctness

The study found that the disagreement degree of model outputs is strongly correlated with instance difficulty and prediction correctness:

  • Low disagreement (model is confident): Multiple sampling outputs are highly consistent;
  • Moderate disagreement: Sampling results have obvious differences, but the correct answer is mostly in the majority;
  • High disagreement: Results vary greatly, and even the majority answer may not be correct. Disagreement degree can serve as a free indicator of instance difficulty, which can be estimated through a small number of samples without additional computation.
4

Section 04

Strategy Routing Framework: Dynamically Selecting Optimal Computational Strategies

The framework dynamically selects strategies based on disagreement degree:

  1. Lightweight Parsing: For low disagreement, take the first or first few sampling results with almost no additional cost;
  2. Majority Voting: For moderate disagreement, generate multiple samples and select the most common answer to filter out occasional errors;
  3. Rewrite and Reconstruction: For high disagreement, change the presentation by restating the problem, decomposing subproblems, etc., to provide new reasoning entry points.
5

Section 05

Implementation Advantages: Training Freedom and Modular Design

The framework has the feature of training freedom—no additional model training/fine-tuning is required, and it can be seamlessly integrated into existing LRM reasoning pipelines. Implementation process:

  1. Initial sampling (3-5 times) to obtain candidate outputs;
  2. Calculate disagreement degree (string matching, semantic similarity, etc.);
  3. Route strategies according to thresholds;
  4. Output results. The modular design supports adjusting parameters such as thresholds, sampling times, and rewriting strategies.
6

Section 06

Experimental Validation: Accuracy and Efficiency Improvements on Mathematical Benchmarks

Validated on seven mathematical reasoning benchmarks (including arithmetic, algebra, etc.) and three LRM models:

  • Average accuracy improvement of 3%-7%, statistically significant and consistent across models;
  • Reduced sampling costs, avoiding computational waste on simple problems and ineffective searches on difficult ones;
  • Strategy distribution varies by dataset: lightweight parsing accounts for a high proportion in simple datasets, while majority voting is dominant in competition-level datasets.
7

Section 07

Technical Implications and Future Research Directions

The study brings three implications:

  1. Disagreement as a meta-signal can guide resource allocation and be extended to scenarios such as active learning and uncertainty quantification;
  2. Strategy diversity is important—future research can explore human-machine collaboration strategies like external tool calling and multimodal reasoning;
  3. Rewriting strategies deserve attention—systematic and automated problem rewriting can improve reasoning effects.
8

Section 08

Limitations and Open Problems

The current framework has limitations:

  • Disagreement degree thresholds rely on heuristics/grid search, lacking theoretical guidance;
  • Rewriting strategies are based on templates/rules and need more intelligent methods;
  • Experiments are limited to mathematical tasks, and the effectiveness across domains (code generation, common sense reasoning) remains to be verified. Future work needs to optimize threshold selection, intelligent rewriting, and cross-domain adaptation.