Reading

STACK: An Efficient Reasoning Compression Framework for Large Reasoning Models to "Think Less, Do More"

This article introduces the STACK framework, which reduces reasoning length by 59.9% while maintaining or even improving accuracy through state-aware reasoning compression and knowledge guidance. This method dynamically identifies redundant reasoning steps and combines PPO and DPO training strategies, opening up a new path for efficiency optimization of large reasoning models.

大推理模型思维链压缩高效推理PPODPO检索增强过度思考机器学习

Published 2026-04-10 17:31Recent activity 2026-04-13 09:53Estimated read 8 min

STACK: An Efficient Reasoning Compression Framework for Large Reasoning Models to "Think Less, Do More"

Section 01

Introduction to the STACK Framework: A New Path for Efficient Reasoning of Large Reasoning Models

Large reasoning models (e.g., OpenAI o1, DeepSeek-R1) rely on lengthy thought chains to achieve breakthroughs in complex tasks, but overthinking leads to high computational costs, reasoning delays, and decreased accuracy. The STACK framework, through state-aware reasoning compression and knowledge guidance, reduces reasoning length by 59.9% while increasing accuracy by 4.8 percentage points across three mathematical reasoning benchmarks, opening a new path for efficiency optimization of large models.

Section 02

Background: Overthinking Problems of Large Reasoning Models and Limitations of Existing Compression Methods

Overthinking Phenomena

Redundant Verification Loops: After reaching an initial conclusion, repeatedly verifying the same step, generating a large number of tokens with no new information;
Self-Correction Quagmire: Falling into a cycle of doubt and correction, which may eventually lead to wrong answers;
Irrelevant Knowledge Proliferation: Calling on background knowledge unrelated to the problem, wasting resources and introducing interference.

Limitations of Existing Compression Methods

Coarse-grained Compression: Lacks fine-grained analysis, easily deletes key steps or retains redundancy;
Static Strategies: Fixed rules cannot adapt to dynamic reasoning stages;
Trade-off Dilemma: Aggressive compression sacrifices accuracy, while conservative compression fails to solve the root problem.

Section 03

Core Design of the STACK Framework: State Awareness and Dynamic Compression Mechanism

STACK solves the problem through three innovations:

State Awareness: Dynamically identifies two redundant states—uncertain/biased state (requires external knowledge guidance) and overconfident long reasoning state (can be self-compressed);
Dual Compression Mechanism:
- Knowledge-guided Compression: Retrieves external knowledge bases to correct biases, provide compression references, and enhance confidence;
- Self-prompt Compression: Guides the model to identify repeated steps and generate concise equivalent reasoning;
Early Stopping on Answer Convergence: Terminates reasoning when the answer remains the same for N consecutive steps and confidence is stable, suppressing redundant verification.

Section 04

Training Strategy: Hybrid Training with PPO and DPO Collaboration

Online Comparative Sample Construction

Generate long versions (free thought chains) and short versions (compressed reasoning) for each problem as preference pairs.

Hybrid Training Objectives

PPO Component: Optimizes the policy network to stably select compression actions;
DPO Component: Uses preference signals to train concise reasoning generation;
Reward Function: Includes accuracy rewards (positive for correct answers/negative for wrong answers) and efficiency rewards (higher for shorter lengths, with an over-compression threshold set).

Section 05

Experimental Validation: Win-Win of Efficiency and Accuracy

Benchmark Settings

Tested on three benchmarks: GSM8K (elementary school math), MATH (high school competition), and OlympiadBench (Olympiad-level difficult problems).

Core Results

Reasoning length reduced by 59.9%, with over 70% compression for some simple problems;
Accuracy increased by 4.8 percentage points, proving that overthinking impairs performance;
Cross-model Consistency: Applicable to models like Llama, Qwen, GPT-4, etc.

Ablation Experiments

Removing state awareness leads to a significant performance drop;
Knowledge guidance + self-prompt achieves the best effect;
Early stopping mechanism saves computation and improves accuracy simultaneously;
Hybrid training is better than pure PPO or pure DPO.

Section 06

Application Prospects: Implications for Deployment and Research

Deployment Significance

Cost Reduction: Halving reasoning length lowers computational costs;
Experience Improvement: Lower latency improves real-time interaction scenarios;
Environmental Protection: Reduces energy consumption and carbon emissions.

Research Implications

Efficiency and capability can be achieved simultaneously; intelligence needs to "know when to stop";
Metacognitive ability (self-state awareness) is an improvement direction;
RAG technology can be used to optimize the reasoning process.

Section 07

Limitations and Future Work

Limitations

Domain Generalization: Only verified on mathematical reasoning; needs to be extended to creative writing, dialogue, etc.;
Knowledge Base Dependence: The effect of knowledge guidance is affected by the quality of external knowledge bases;
Compression Limit: Accuracy drops beyond the threshold; need to determine the optimal ratio;
Interpretability: The logic of compression decisions is not transparent enough.

Future Directions

Explore cross-domain adaptation, optimize knowledge base dependence, study compression limits, and improve interpretability.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15