# Inference-Time Computational Optimization for Reasoning Models: A Comparative Study of SFT and GRPO Fine-Tuning Strategies

> This study systematically explores the impact of different inference-time computational strategies (majority voting, Best-of-N, PRM-guided beam search, budget enforcement) on reasoning accuracy under a fixed inference computational budget, and compares the effect differences between two fine-tuning methods: SFT and GRPO.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-18T18:45:16.000Z
- 最近活动: 2026-04-18T18:51:48.683Z
- 热度: 159.9
- 关键词: 测试时计算, 推理优化, SFT微调, GRPO, 过程奖励模型, 束搜索, 多数投票, 计算预算
- 页面链接: https://www.zingnex.cn/en/forum/thread/sftgrpo
- Canonical: https://www.zingnex.cn/forum/thread/sftgrpo
- Markdown 来源: floors_fallback

---

## 【Introduction】Inference-Time Computational Optimization for Reasoning Models: A Comparative Study of SFT and GRPO Fine-Tuning Strategies

# Inference-Time Computational Optimization for Reasoning Models: A Comparative Study of SFT and GRPO Fine-Tuning Strategies

This study focuses on the impact of different inference-time computational strategies (majority voting, Best-of-N, PRM-guided beam search, budget enforcement) on reasoning accuracy under a fixed inference computational budget, and compares the effect differences between two fine-tuning methods: SFT and GRPO. The core question is: Does the optimal inference-time strategy depend on the fine-tuning method? The study reveals the interaction effect between fine-tuning methods and inference-time strategies, providing references for the design of efficient reasoning systems.

## Research Background and Core Questions

## Research Background and Core Questions

In recent years, large language models have shown improved performance in reasoning tasks (mathematics, code, logic), but the inference cost has increased dramatically. How to maximize accuracy within a limited computational budget has become a key challenge for deployment.

Inference-time computational strategies improve accuracy at a low additional cost by generating and filtering multiple candidate answers during the inference phase.

Core questions: Which inference-time strategy achieves the highest accuracy under a fixed budget? Does the choice of optimal strategy depend on the fine-tuning method (SFT vs GRPO)?

## Overview of Inference-Time Computational Strategies

## Overview of Inference-Time Computational Strategies

Four mainstream strategies are evaluated:

### 1. Majority Voting
A simple integration strategy: generate multiple independent answers and select the one with the highest frequency. Advantages: easy to implement without additional models; Disadvantages: poor performance when the correct answer does not account for the majority.

### 2. Best-of-N with PRM
Generate N candidates and select the one with the highest score using a Process Reward Model (PRM). PRM evaluates the rationality of the reasoning process, making it more reliable for complex tasks.

### 3. PRM-Guided Beam Search
Maintain a candidate beam at each step, use PRM to guide the search direction, and prioritize exploring promising paths. It uses the budget more effectively than independent sampling but is complex to implement.

### 4. Budget Enforcement
Dynamically adjust the generation length/thinking depth to control computational consumption, balancing efficiency and quality.

## Comparison of SFT and GRPO Fine-Tuning Paradigms

## Comparison of SFT and GRPO Fine-Tuning Paradigms

### Supervised Fine-Tuning (SFT)
A mainstream method that learns task patterns through supervised learning on high-quality annotated data. Advantages: stable training, fast convergence, directly learning expert thinking; Disadvantages: limited generalization ability (out-of-distribution problems).

### GRPO Fine-Tuning
Based on reinforcement learning, it optimizes strategies to maximize rewards. It does not directly learn fixed patterns but explores diverse problem-solving strategies; Challenges: unstable training, reward hacking.

## Research Findings and Insights

## Research Findings and Insights

Core finding: There is a significant interaction effect between fine-tuning methods and inference-time strategies.

- For SFT models: Majority voting can achieve considerable accuracy improvement (consistent answer patterns).
- For GRPO models: Complex PRM-guided strategies are better (high answer diversity requires fine-grained filtering).

Impact of budget size: Simple strategies have high cost-effectiveness when the budget is small; complex search strategies can better utilize resource value when the budget is large.

## Practical Application Significance

## Practical Application Significance

It provides direct guidance for the deployment of large reasoning models:

- Developers need to select inference strategies based on the model training method, rather than considering the strategy in isolation.
- Resource-constrained scenarios: Find the strategy that achieves the maximum accuracy improvement with the minimum computational overhead.
- Extreme performance scenarios: Understand the upper limits and boundaries of strategies to design efficient reasoning systems.

## Future Research Directions

## Future Research Directions

- Design adaptive inference-time strategies: dynamically adjust computational allocation based on problem difficulty.
- Build hybrid reasoning frameworks: combine the advantages of multiple strategies.
- Adapt to model capability improvements: evolve inference-time strategies to match new model characteristics.

## Conclusion

## Conclusion

Against the backdrop of enhanced reasoning capabilities of large language models, efficient utilization of computational resources is a key issue. This study provides empirical evidence and decision-making references for efficient reasoning systems by systematically comparing the combined effects of inference-time strategies and fine-tuning methods. We look forward to the emergence of more intelligent and efficient reasoning paradigms.