Zing Forum

Reading

ACTS: Efficient and Controllable LLM Reasoning via Agentic Chain-of-Thought Steering

ACTS models reasoning guidance as a Markov Decision Process, where a controller agent dynamically selects strategies during reasoning. It achieves significant token savings and a controllable accuracy-efficiency trade-off while maintaining reasoning quality.

思维链推理智能体强化学习推理控制效率优化
Published 2026-06-03 01:51Recent activity 2026-06-03 12:24Estimated read 7 min
ACTS: Efficient and Controllable LLM Reasoning via Agentic Chain-of-Thought Steering
1

Section 01

ACTS: Guide to an Efficient and Controllable Agent-Guided LLM Reasoning Solution

ACTS (Agentic Chain-of-Thought Steering) is an efficient and controllable framework for LLM chain-of-thought reasoning. Its core is modeling reasoning guidance as a Markov Decision Process, using a dual-agent architecture (frozen reasoner + controller agent) to dynamically select strategies. It achieves significant token savings while maintaining reasoning quality and supports flexible accuracy-efficiency trade-offs. This research provides a new path for fine-grained control of LLM reasoning.

2

Section 02

Background: Problems of Chain-of-Thought Reasoning and Limitations of Existing Methods

The Double-Edged Sword of Chain-of-Thought Reasoning

Large language models improve accuracy through chain-of-thought (CoT) reasoning, but have two major flaws:

  1. Inefficient token consumption: Generates a lot of redundant content, wasting computing resources;
  2. Lack of reasoning control: Users cannot intervene in the direction and depth of thinking.

Limitations of Existing Methods

Existing efficient reasoning methods (shortening, early stopping, compression) only focus on "how much to say" and do not address "how to think". The reasoning strategy remains a black box, lacking explicit guidance and control.

3

Section 03

Core Methods of ACTS: Dual-Agent Architecture and Training Process

Dual-Agent Architecture

  • Frozen Reasoner: Responsible for actual reasoning generation, kept frozen to retain basic capabilities;
  • Controller Agent: A lightweight policy network that decides guidance actions (reasoning strategy + guidance phrase) at each step.

MDP Modeling

Modeling reasoning steps as a Markov Decision Process:

  • State: Summary of current reasoning trajectory + remaining thinking budget;
  • Action: Reasoning strategy (e.g., detailed analysis/quick verification) + guidance phrase;
  • Reward: A signal that integrates budget conditions and reasoning quality.

Training Methods

  1. Synthetic Trajectory Initialization: Supervised learning based on multi-budget augmented examples to gain basic guidance capabilities;
  2. Reinforcement Learning Optimization: Optimize the controller through budget-conditional reward shaping (considering quality, efficiency, and strategy consistency).
4

Section 04

Experimental Results: Balance Between Quality and Efficiency, and Generalization Ability

Key Experimental Conclusions

  1. Maintain Reasoning Quality: While significantly reducing token consumption, performance is comparable to full reasoning;
  2. Significant Token Savings: Compared to unguided reasoning, it achieves substantial token savings, reducing costs and improving response speed;
  3. Controllable Trade-off: Supports flexible adjustment of budget parameters to balance accuracy and efficiency (e.g., allocate more budget for high-accuracy scenarios);
  4. Cross-Model Generalization: Its effectiveness has been verified on different reasoners and tasks.
5

Section 05

Technical Innovations and Summary: Core Value of ACTS

Technical Insights

  1. Control Upgrade: From "controlling output" to "controlling strategy", improving reasoning transparency and adjustability;
  2. Collaboration Paradigm: The dual-agent division of labor (reasoner provides basic capabilities, controller is responsible for strategy) provides new ideas for LLM system design;
  3. Budget Awareness: Incorporate resource budget into decision-making to adapt to resource-constrained scenarios.

Summary

ACTS achieves efficient and controllable LLM reasoning through MDP modeling and dual-agent architecture. It saves tokens while maintaining quality and supports flexible trade-offs, which has important theoretical and practical value.

6

Section 06

Application Scenarios: Applicable Fields and Prospects of ACTS

ACTS technology is applicable to the following scenarios:

  1. Cost-sensitive production environments: Commercial applications that balance reasoning quality and API call costs;
  2. Real-time interaction systems: Scenarios where chatbots/real-time assistants need fast responses;
  3. Multi-level reasoning tasks: Complex tasks that dynamically adjust reasoning strategies for different subtasks. This framework provides a feasible path for fine-grained control of LLM reasoning and is expected to be implemented in more practical scenarios in the future.