# LLM Optimization Strategies Under Compute Budget Constraints: A Trade-off Analysis Between Fine-tuning and Inference-time Expansion

> The compute-scaling-frontier project uses systematic experimental design to explore the optimal trade-off between fine-tuning training and inference-time expansion strategies for small language models under a fixed compute budget, providing decision-making support for model deployment in cost-sensitive scenarios.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-03T23:13:54.000Z
- 最近活动: 2026-05-03T23:23:56.964Z
- 热度: 154.8
- 关键词: 计算预算优化, 微调训练, 推理时扩展, LoRA, 自洽性推理, 成本分析, GSM8K, 小型语言模型, 帕累托前沿, 模型部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-a1eb29da
- Canonical: https://www.zingnex.cn/forum/thread/llm-a1eb29da
- Markdown 来源: floors_fallback

---

## [Introduction] LLM Optimization Strategies Under Compute Budget Constraints: A Study on the Trade-off Between Fine-tuning and Inference Expansion

This project focuses on optimization strategies for small language models under a fixed compute budget, with the core exploration of the optimal trade-off between investing resources in one-time fine-tuning training or inference-time expansion (e.g., self-consistency reasoning). Through experiments on the GSM8K mathematical reasoning benchmark, combined with LoRA fine-tuning, synthetic data generation, and integration of inference strategies, it aims to provide quantitative decision-making support for model deployment in cost-sensitive scenarios and map the (cost-accuracy) Pareto frontier.

## Background and Core Problem

In LLM deployment, compute resources are a key constraint. Developers face a decision dilemma: under a limited budget, should resources be invested in one-time fine-tuning (fixed cost) or inference-time expansion (variable cost that grows linearly with query volume)? This trade-off depends on the expected query volume—for low query volumes, inference expansion may be better, while for high query volumes, fine-tuning costs can be amortized. This project aims to find the optimal strategy boundary through experiments.

## Experimental Design and Technical Components

The experiment uses GSM8K as the evaluation benchmark, adopts the Qwen2.5-1.5B-Instruct model, and integrates three core libraries:
1. sdg_hub: Uses GPT-4o-mini to generate synthetic mathematical reasoning data, reducing annotation costs;
2. training_hub: Provides LoRA parameter-efficient fine-tuning capabilities;
3. its_hub: Implements strategies such as greedy decoding and self-consistency reasoning.
The experiment grid covers model variants, training data scale, inference strategies, budget allocation, and cost calculation under multiple query volumes.

## Key Technical Findings and Optimizations

Two issues were found during implementation:
1. Self-consistency reasoning by default votes on the entire response text, which is not suitable for GSM8K (needs to focus on the final answer). This was resolved by mapping to the numerical space using the `final_answer_projection` function;
2. max_tokens=256 caused truncation of some responses. It was adjusted to 512, and format diagnostic metrics (e.g., has_final_marker_rate) were added to monitor generation quality.

## Cost Modeling and Economic Analysis

A simplified cost model was established:
- Synthetic data cost: Calculated based on the number of samples and the teacher model;
- Training cost: Number of samples + GPU training hours (LoRA significantly reduces costs);
- Inference cost: Determined by model token count, sampling times, etc. (self-consistency reasoning cost is higher than greedy decoding);
- Total cost formula: Training cost + query volume × single inference cost. This model can clarify the break-even point and guide strategy selection.

## Current Progress and Future Plans

Currently, local vertical slicing and smoke testing (verifying end-to-end component connections) have been completed. Full LoRA training and Pareto charts are still in progress. In the future, we will complete training, generate the Pareto frontier, expand model/task domains, and explore inference strategies such as Best-of-N.

## Practical Insights and Recommendations

Recommendations for developers:
1. Clarify the expected query volume (key input for strategy selection);
2. Establish a full-life-cycle cost model (training, inference, operation and maintenance);
3. Balance accuracy and inference cost;
4. Maintain strategy flexibility (dynamically adjust with query volume).
The project's open-source framework provides a reusable experimental foundation for the community, supporting strategy exploration in different scenarios.