Zing Forum

Reading

COMPASS: A Cognitive MCTS-Guided Process Alignment Framework for Safe Search Agents

COMPASS is a novel safety alignment framework that effectively addresses retrieval-induced safety issues faced by search agents in multi-step interactions through cognitive tree exploration and introspective step-by-step alignment.

AI安全搜索智能体MCTS过程对齐对抗攻击安全对齐多步推理工具使用
Published 2026-05-29 12:51Recent activity 2026-06-01 12:51Estimated read 4 min
COMPASS: A Cognitive MCTS-Guided Process Alignment Framework for Safe Search Agents
1

Section 01

Introduction: COMPASS—A Cognitive MCTS-Guided Process Alignment Framework for Safe Search Agents

COMPASS is a novel process alignment framework for safe search agents, designed to address retrieval-induced safety issues in multi-step interactions. Its core adopts a dual-pillar design: Cognitive Tree Exploration (CTE) and Introspective Step-by-Step Alignment (ISA), which achieves an effective balance between safety and utility by proactively discovering hidden attack trajectories and fine-grained risk localization.

2

Section 02

Background: New Retrieval-Induced Safety Challenges Faced by Search Agents

Search agents driven by large language models possess capabilities like multi-step reasoning and tool calling, but they also face the risk of retrieval-induced safety degradation: harmful intentions can be decomposed into combinations of seemingly harmless sub-queries, bypassing traditional safety checks. Existing alignment methods struggle to capture sparse safety signals and cannot effectively supervise diverse violations in multi-step interactions, necessitating new solutions.

3

Section 03

Methodology: Dual-Pillar Design of the COMPASS Framework

The COMPASS framework consists of two core modules:

  1. Cognitive Tree Exploration (CTE):Drawing on Monte Carlo Tree Search (MCTS), it proactively explores and synthesizes hidden attack trajectories, discovering complex attack patterns that traditional methods struggle to identify;
  2. Introspective Step-by-Step Alignment (ISA):It performs fine-grained analysis of each step in the interaction sequence, locates "dangerous intermediate actions", and achieves precise safety intervention while minimizing impact on normal functions.
4

Section 04

Evidence: Advantages of COMPASS in Safety-Utility Trade-off

Experimental results show that COMPASS maintains high safety while having the minimal impact on the agent's general utility; moreover, compared to existing methods, it requires significantly less training data, making it more conducive to practical deployment.

5

Section 05

Conclusion: Implications of COMPASS for AI Safety

The research on COMPASS brings key implications for AI safety:

  • Process supervision is more suitable for safety alignment of complex agents than outcome supervision;
  • Proactive attack discovery (red team thinking) should become a standard practice in AI safety development;
  • Fine-grained supervision enhances system transparency and auditability, promoting the development of explainable AI safety.
6

Section 06

Future Directions: Limitations and Expansion Paths of COMPASS

COMPASS still has room for expansion:

  1. Need to expand to multi-modal scenarios (e.g., images, code execution);
  2. Improve the risk localization accuracy of ISA (e.g., specific tokens or reasoning steps);
  3. Explore further reduction of training data requirements to achieve zero-shot or few-shot safety alignment.