Zing Forum

Reading

Inference-Time Computational Optimization for Reasoning Models: A Comparative Study of SFT and GRPO Fine-Tuning Strategies

This study systematically explores the impact of different inference-time computational strategies (majority voting, Best-of-N, PRM-guided beam search, budget enforcement) on reasoning accuracy under a fixed inference computational budget, and compares the effect differences between two fine-tuning methods: SFT and GRPO.

测试时计算推理优化SFT微调GRPO过程奖励模型束搜索多数投票计算预算
Published 2026-04-19 02:45Recent activity 2026-04-19 02:51Estimated read 8 min
Inference-Time Computational Optimization for Reasoning Models: A Comparative Study of SFT and GRPO Fine-Tuning Strategies
1

Section 01

【Introduction】Inference-Time Computational Optimization for Reasoning Models: A Comparative Study of SFT and GRPO Fine-Tuning Strategies

Inference-Time Computational Optimization for Reasoning Models: A Comparative Study of SFT and GRPO Fine-Tuning Strategies

This study focuses on the impact of different inference-time computational strategies (majority voting, Best-of-N, PRM-guided beam search, budget enforcement) on reasoning accuracy under a fixed inference computational budget, and compares the effect differences between two fine-tuning methods: SFT and GRPO. The core question is: Does the optimal inference-time strategy depend on the fine-tuning method? The study reveals the interaction effect between fine-tuning methods and inference-time strategies, providing references for the design of efficient reasoning systems.

2

Section 02

Research Background and Core Questions

Research Background and Core Questions

In recent years, large language models have shown improved performance in reasoning tasks (mathematics, code, logic), but the inference cost has increased dramatically. How to maximize accuracy within a limited computational budget has become a key challenge for deployment.

Inference-time computational strategies improve accuracy at a low additional cost by generating and filtering multiple candidate answers during the inference phase.

Core questions: Which inference-time strategy achieves the highest accuracy under a fixed budget? Does the choice of optimal strategy depend on the fine-tuning method (SFT vs GRPO)?

3

Section 03

Overview of Inference-Time Computational Strategies

Overview of Inference-Time Computational Strategies

Four mainstream strategies are evaluated:

1. Majority Voting

A simple integration strategy: generate multiple independent answers and select the one with the highest frequency. Advantages: easy to implement without additional models; Disadvantages: poor performance when the correct answer does not account for the majority.

2. Best-of-N with PRM

Generate N candidates and select the one with the highest score using a Process Reward Model (PRM). PRM evaluates the rationality of the reasoning process, making it more reliable for complex tasks.

3. PRM-Guided Beam Search

Maintain a candidate beam at each step, use PRM to guide the search direction, and prioritize exploring promising paths. It uses the budget more effectively than independent sampling but is complex to implement.

4. Budget Enforcement

Dynamically adjust the generation length/thinking depth to control computational consumption, balancing efficiency and quality.

4

Section 04

Comparison of SFT and GRPO Fine-Tuning Paradigms

Comparison of SFT and GRPO Fine-Tuning Paradigms

Supervised Fine-Tuning (SFT)

A mainstream method that learns task patterns through supervised learning on high-quality annotated data. Advantages: stable training, fast convergence, directly learning expert thinking; Disadvantages: limited generalization ability (out-of-distribution problems).

GRPO Fine-Tuning

Based on reinforcement learning, it optimizes strategies to maximize rewards. It does not directly learn fixed patterns but explores diverse problem-solving strategies; Challenges: unstable training, reward hacking.

5

Section 05

Research Findings and Insights

Research Findings and Insights

Core finding: There is a significant interaction effect between fine-tuning methods and inference-time strategies.

  • For SFT models: Majority voting can achieve considerable accuracy improvement (consistent answer patterns).
  • For GRPO models: Complex PRM-guided strategies are better (high answer diversity requires fine-grained filtering).

Impact of budget size: Simple strategies have high cost-effectiveness when the budget is small; complex search strategies can better utilize resource value when the budget is large.

6

Section 06

Practical Application Significance

Practical Application Significance

It provides direct guidance for the deployment of large reasoning models:

  • Developers need to select inference strategies based on the model training method, rather than considering the strategy in isolation.
  • Resource-constrained scenarios: Find the strategy that achieves the maximum accuracy improvement with the minimum computational overhead.
  • Extreme performance scenarios: Understand the upper limits and boundaries of strategies to design efficient reasoning systems.
7

Section 07

Future Research Directions

Future Research Directions

  • Design adaptive inference-time strategies: dynamically adjust computational allocation based on problem difficulty.
  • Build hybrid reasoning frameworks: combine the advantages of multiple strategies.
  • Adapt to model capability improvements: evolve inference-time strategies to match new model characteristics.
8

Section 08

Conclusion

Conclusion

Against the backdrop of enhanced reasoning capabilities of large language models, efficient utilization of computational resources is a key issue. This study provides empirical evidence and decision-making references for efficient reasoning systems by systematically comparing the combined effects of inference-time strategies and fine-tuning methods. We look forward to the emergence of more intelligent and efficient reasoning paradigms.