Zing Forum

Reading

Small_Scale: Pruning Long Chain-of-Thought in Large Reasoning Models via Small-Scale Preference Optimization

The Small_Scale project provides the official implementation of the ICLR 2026 paper, including a complete LLM offline inference evaluation toolkit and DPO training framework, supporting vLLM/SGLang backends, multi-type benchmark tests, and preference optimization training based on LLaMA-Factory.

LLMreasoningchain-of-thoughtpruningpreference optimizationDPOvLLMSGLangevaluationICLR
Published 2026-03-31 14:05Recent activity 2026-03-31 14:26Estimated read 8 min
Small_Scale: Pruning Long Chain-of-Thought in Large Reasoning Models via Small-Scale Preference Optimization
1

Section 01

Introduction to the Small_Scale Project

Small_Scale is the official open-source implementation of the ICLR 2026 paper Pruning Long Chain-of-Thought in Large Reasoning Models via Small-Scale Preference Optimization. It aims to prune long chain-of-thought in large reasoning models through small-scale preference optimization, addressing the issue of high computational overhead. The project provides a complete LLM offline inference evaluation toolkit and DPO training framework, supporting vLLM/SGLang backends, multi-type benchmark tests, and preference optimization training based on LLaMA-Factory, thus offering infrastructure for research and development of reasoning models.

2

Section 02

Research Background and Challenges

Large reasoning models solve complex problems via long chain-of-thought, but excessive reasoning leads to huge computational overhead and latency, limiting practical deployment efficiency. Traditional methods require extensive data fine-tuning or retraining, which are resource-intensive. The core insight of Small_Scale is: through small-scale preference optimization, redundant chain-of-thought content can be effectively pruned without sacrificing reasoning quality.

3

Section 03

Project Overview and Toolkit Architecture

Small_Scale is the official implementation of the ICLR 2026 paper, accompanied by a fully functional LLM evaluation and training toolkit that supports the complete workflow. The toolkit adopts a modular architecture:

  • Configuration layer (config/): Manages global paths, dataset metadata, and other configurations;
  • Data layer (data/test/): Built-in with three categories of authoritative benchmark datasets (in parquet format): mathematics, code, and multiple-choice questions;
  • Inference layer (eval/generation/): Supports vLLM (multi-process/random shuffle/single-process) and SGLang backends;
  • Evaluation layer (eval/judgers/): Implements dedicated judges for mathematics, code, and multiple-choice questions, as well as the LLM-as-Judge mode;
  • Training layer (LLaMA-Factory/): Integrated framework supports DPO training and DeepSpeed ZeRO-3 configuration.
4

Section 04

Detailed Explanation of Core Features

  1. Flexible Inference Backends: Supports vLLM (multi-process data parallelism/random shuffle/single process) and SGLang, adapting to different scenarios;
  2. Comprehensive Benchmark Tests: Covers mathematics (AIME/GSM8K, etc.), code (LiveCodeBench), and multiple-choice (MMLU, etc.) tasks, using corresponding evaluation metrics;
  3. Automated Evaluation: The autojudger module automatically identifies tasks, calls judges, calculates scores, and records logs;
  4. End-to-End Pipeline: The output path of the inference script is written to a temporary file, enabling seamless integration between inference and evaluation.
5

Section 05

Usage Instructions

  • Environment Preparation: Configure the path in config/path.yaml, place model weights, and depend on Python 3.10+ and related libraries;
  • Inference Evaluation: Take vLLM multi-process as an example: python eval/generation/vllm_offline.py --config ... --model_name ... --dataset_name ...;
  • Automated Evaluation: python eval/judgers/autojudger.py --config ... --file_path ...;
  • DPO Training: After configuring dpo.yaml, start with: export CUDA_VISIBLE_DEVICES=...; llamafactory-cli train ....
6

Section 06

Technical Highlights and Application Scenarios

Technical Highlights:

  1. Data Parallelism Optimization: vLLM multi-process sharding improves throughput efficiency, supporting random shuffle to eliminate bias;
  2. Flexible Sampling Configuration: Unified parameter structure, adjustable temperature/top_p, etc., supporting advanced configurations like tensor parallelism;
  3. LLM-as-Judge: Supports calling OpenAI API and others for intelligent evaluation of complex outputs.

Application Scenarios:

  1. Research on Pruning of Reasoning Models: Provides experimental infrastructure;
  2. Model Selection Comparison: Obtains comparable metrics through standardized benchmark tests;
  3. Continuous Integration Monitoring: Easy to integrate into CI/CD pipelines, supporting version regression testing.
7

Section 07

Academic Contributions and Summary

Academic Contributions: The paper corresponding to the project was accepted by ICLR 2026, proposing a method to prune long chain-of-thought via small-scale preference optimization, balancing reasoning ability and efficiency.

Summary: Small_Scale is not only an implementation of the paper but also a fully functional LLM evaluation and training infrastructure. Designs such as modular architecture and multi-backend support lower the research threshold and promote the progress of reasoning model technology.