# Monitoring Inner Monologue: Probing Trajectories Reveal the Dynamic Behavior of Reasoning Models

> This article introduces a method to construct probing trajectories by evaluating detectors at each generated token. It finds that future model behaviors along the complete reasoning trajectory are easier to distinguish than single static predictions, and using max-pooling can achieve an AUROC of 95%.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-18T15:29:04.000Z
- 最近活动: 2026-05-19T03:32:08.489Z
- 热度: 134.9
- 关键词: 推理模型, 安全监控, 链式思维, 内部表示, 探测轨迹, AI安全
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2605-18549v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2605-18549v1
- Markdown 来源: floors_fallback

---

## [Introduction] Monitoring Inner Monologue: Probing Trajectories Reveal the Dynamic Behavior of Reasoning Models

This article introduces a study published in May 2026. Addressing the unreliability of Chain-of-Thought (CoT) in Large Reasoning Models (LRMs), it proposes the probing trajectory method: by monitoring the model's internal hidden representations and evaluating detectors at each generated token position to construct trajectories, it finds that complete trajectories are easier to distinguish future behaviors than single static predictions. The max-pooling operation can achieve an AUROC of 95%, providing a new perspective for LRM safety monitoring.

## Background: Unreliability of CoT as a Safety Monitoring Tool

The safety monitoring value of CoT relies on the assumption that "the thinking process faithfully reflects the final decision", but there are three major issues: 1. Unfaithful CoT: The reasoning steps are logically inconsistent with the final output; 2. Strategic CoT: Generates seemingly correct thinking, but the actual decision-making process is different; 3. Unverifiable CoT: It is difficult to confirm whether it truly reflects internal reasoning. These weaken the reliability of CoT, requiring alternative monitoring methods.

## Method: Construction and Feature Extraction of Probing Trajectories

**Construction of Probing Trajectories**: 1. Evaluate trained detectors at each generated token position; 2. Arrange the concept probabilities output by the detectors to form a continuous trajectory; 3. Analyze the dynamic features of the trajectory. **Feature Extraction**: Volatility (magnitude of probability change), trend (overall direction of change), steady-state behavior (degree of stability in the later stage of reasoning). These features improve the separability of future model states.

## Key Findings: Trajectory Advantages and Methodological Breakthroughs

1. Trajectories outperform static predictions: Complete probing trajectories are easier to distinguish future behaviors than predictions from a single position (e.g., the last token); 2. Efficacy of template training data: Performance is close to that of dynamically generated responses, reducing costs and having high repeatability; 3. Key role of pooling operations: Max-pooling achieves an AUROC of 95%, while average/last token pooling is close to random levels.

## Experimental Evidence: Validation in Safety and Mathematics Domains

Experiments were conducted on 4 datasets and 4 reasoning models: In the safety domain, harmful outputs can be predicted (95% AUROC for early warning); In the mathematics domain, the correctness of answers can be predicted; Probing trajectories encode task-specific dynamics (different patterns are presented in safety vs. mathematics).

## Application Prospects and Limitations

**Application Prospects**: Safety monitoring (early warning of harmful outputs), reasoning validation (predicting answer correctness), model debugging (understanding reasoning processes), human-machine collaboration (providing confidence signals). **Limitations**: The generalization ability of detectors across domains/models needs to be improved; real-time performance may increase reasoning latency; adversarial robustness requires further research.

## Conclusion and References

The probing trajectory method captures the dynamic evolution of the reasoning process by monitoring the model's internal representations, achieving high-precision prediction of future behaviors, and is a powerful tool for LRM safety monitoring. Reference: Paper address http://arxiv.org/abs/2605.18549v1, published on May 18, 2026.