Zing Forum

Reading

Monitoring Inner Monologue: Probing Trajectories Reveal the Dynamic Behavior of Reasoning Models

This article introduces a method to construct probing trajectories by evaluating detectors at each generated token. It finds that future model behaviors along the complete reasoning trajectory are easier to distinguish than single static predictions, and using max-pooling can achieve an AUROC of 95%.

推理模型安全监控链式思维内部表示探测轨迹AI安全
Published 2026-05-18 23:29Recent activity 2026-05-19 11:32Estimated read 6 min
Monitoring Inner Monologue: Probing Trajectories Reveal the Dynamic Behavior of Reasoning Models
1

Section 01

[Introduction] Monitoring Inner Monologue: Probing Trajectories Reveal the Dynamic Behavior of Reasoning Models

This article introduces a study published in May 2026. Addressing the unreliability of Chain-of-Thought (CoT) in Large Reasoning Models (LRMs), it proposes the probing trajectory method: by monitoring the model's internal hidden representations and evaluating detectors at each generated token position to construct trajectories, it finds that complete trajectories are easier to distinguish future behaviors than single static predictions. The max-pooling operation can achieve an AUROC of 95%, providing a new perspective for LRM safety monitoring.

2

Section 02

Background: Unreliability of CoT as a Safety Monitoring Tool

The safety monitoring value of CoT relies on the assumption that "the thinking process faithfully reflects the final decision", but there are three major issues: 1. Unfaithful CoT: The reasoning steps are logically inconsistent with the final output; 2. Strategic CoT: Generates seemingly correct thinking, but the actual decision-making process is different; 3. Unverifiable CoT: It is difficult to confirm whether it truly reflects internal reasoning. These weaken the reliability of CoT, requiring alternative monitoring methods.

3

Section 03

Method: Construction and Feature Extraction of Probing Trajectories

Construction of Probing Trajectories: 1. Evaluate trained detectors at each generated token position; 2. Arrange the concept probabilities output by the detectors to form a continuous trajectory; 3. Analyze the dynamic features of the trajectory. Feature Extraction: Volatility (magnitude of probability change), trend (overall direction of change), steady-state behavior (degree of stability in the later stage of reasoning). These features improve the separability of future model states.

4

Section 04

Key Findings: Trajectory Advantages and Methodological Breakthroughs

  1. Trajectories outperform static predictions: Complete probing trajectories are easier to distinguish future behaviors than predictions from a single position (e.g., the last token); 2. Efficacy of template training data: Performance is close to that of dynamically generated responses, reducing costs and having high repeatability; 3. Key role of pooling operations: Max-pooling achieves an AUROC of 95%, while average/last token pooling is close to random levels.
5

Section 05

Experimental Evidence: Validation in Safety and Mathematics Domains

Experiments were conducted on 4 datasets and 4 reasoning models: In the safety domain, harmful outputs can be predicted (95% AUROC for early warning); In the mathematics domain, the correctness of answers can be predicted; Probing trajectories encode task-specific dynamics (different patterns are presented in safety vs. mathematics).

6

Section 06

Application Prospects and Limitations

Application Prospects: Safety monitoring (early warning of harmful outputs), reasoning validation (predicting answer correctness), model debugging (understanding reasoning processes), human-machine collaboration (providing confidence signals). Limitations: The generalization ability of detectors across domains/models needs to be improved; real-time performance may increase reasoning latency; adversarial robustness requires further research.

7

Section 07

Conclusion and References

The probing trajectory method captures the dynamic evolution of the reasoning process by monitoring the model's internal representations, achieving high-precision prediction of future behaviors, and is a powerful tool for LRM safety monitoring. Reference: Paper address http://arxiv.org/abs/2605.18549v1, published on May 18, 2026.