Reading

LLM Optimization Strategies Under Compute Budget Constraints: A Trade-off Analysis Between Fine-tuning and Inference-time Expansion

The compute-scaling-frontier project uses systematic experimental design to explore the optimal trade-off between fine-tuning training and inference-time expansion strategies for small language models under a fixed compute budget, providing decision-making support for model deployment in cost-sensitive scenarios.

计算预算优化微调训练推理时扩展LoRA自洽性推理成本分析GSM8K小型语言模型帕累托前沿模型部署

Published 2026-05-04 07:13Recent activity 2026-05-04 07:23Estimated read 6 min

LLM Optimization Strategies Under Compute Budget Constraints: A Trade-off Analysis Between Fine-tuning and Inference-time Expansion

Section 01

[Introduction] LLM Optimization Strategies Under Compute Budget Constraints: A Study on the Trade-off Between Fine-tuning and Inference Expansion

This project focuses on optimization strategies for small language models under a fixed compute budget, with the core exploration of the optimal trade-off between investing resources in one-time fine-tuning training or inference-time expansion (e.g., self-consistency reasoning). Through experiments on the GSM8K mathematical reasoning benchmark, combined with LoRA fine-tuning, synthetic data generation, and integration of inference strategies, it aims to provide quantitative decision-making support for model deployment in cost-sensitive scenarios and map the (cost-accuracy) Pareto frontier.

Section 02

Background and Core Problem

In LLM deployment, compute resources are a key constraint. Developers face a decision dilemma: under a limited budget, should resources be invested in one-time fine-tuning (fixed cost) or inference-time expansion (variable cost that grows linearly with query volume)? This trade-off depends on the expected query volume—for low query volumes, inference expansion may be better, while for high query volumes, fine-tuning costs can be amortized. This project aims to find the optimal strategy boundary through experiments.

Section 03

Experimental Design and Technical Components

The experiment uses GSM8K as the evaluation benchmark, adopts the Qwen2.5-1.5B-Instruct model, and integrates three core libraries:

sdg_hub: Uses GPT-4o-mini to generate synthetic mathematical reasoning data, reducing annotation costs;
training_hub: Provides LoRA parameter-efficient fine-tuning capabilities;
its_hub: Implements strategies such as greedy decoding and self-consistency reasoning. The experiment grid covers model variants, training data scale, inference strategies, budget allocation, and cost calculation under multiple query volumes.

Section 04

Key Technical Findings and Optimizations

Two issues were found during implementation:

Self-consistency reasoning by default votes on the entire response text, which is not suitable for GSM8K (needs to focus on the final answer). This was resolved by mapping to the numerical space using the final_answer_projection function;
max_tokens=256 caused truncation of some responses. It was adjusted to 512, and format diagnostic metrics (e.g., has_final_marker_rate) were added to monitor generation quality.

Section 05

Cost Modeling and Economic Analysis

A simplified cost model was established:

Synthetic data cost: Calculated based on the number of samples and the teacher model;
Training cost: Number of samples + GPU training hours (LoRA significantly reduces costs);
Inference cost: Determined by model token count, sampling times, etc. (self-consistency reasoning cost is higher than greedy decoding);
Total cost formula: Training cost + query volume × single inference cost. This model can clarify the break-even point and guide strategy selection.

Section 06

Current Progress and Future Plans

Currently, local vertical slicing and smoke testing (verifying end-to-end component connections) have been completed. Full LoRA training and Pareto charts are still in progress. In the future, we will complete training, generate the Pareto frontier, expand model/task domains, and explore inference strategies such as Best-of-N.

Section 07

Practical Insights and Recommendations

Recommendations for developers:

Clarify the expected query volume (key input for strategy selection);
Establish a full-life-cycle cost model (training, inference, operation and maintenance);
Balance accuracy and inference cost;
Maintain strategy flexibility (dynamically adjust with query volume). The project's open-source framework provides a reusable experimental foundation for the community, supporting strategy exploration in different scenarios.

LLM Optimization Strategies Under Compute Budget Constraints: A Trade-off Analysis Between Fine-tuning and Inference-time Expansion

[Introduction] LLM Optimization Strategies Under Compute Budget Constraints: A Study on the Trade-off Between Fine-tuning and Inference Expansion

Background and Core Problem

Experimental Design and Technical Components

Key Technical Findings and Optimizations

Cost Modeling and Economic Analysis

Current Progress and Future Plans

Practical Insights and Recommendations

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model