# Does Reasoning Ability Come at the Cost of Alignment? The Trustworthiness Crisis of Large Reasoning Models

> Studies have found that converting instruction-tuned models into reasoning models often leads to alignment degradation, including increased toxicity, amplified biases, and privacy leaks. This calls for the inclusion of trustworthiness metrics in the evaluation of reasoning models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-09T16:14:27.000Z
- 最近活动: 2026-06-10T02:52:47.552Z
- 热度: 138.4
- 关键词: 推理模型, AI安全, 对齐性, 可信度, 偏见, 隐私保护, 模型评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2606-11046v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2606-11046v1
- Markdown 来源: floors_fallback

---

## [Introduction] The Trustworthiness Crisis of Reasoning Models: Does Ability Improvement Sacrifice Alignment?

This post is based on the study *Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models* published on arXiv on June 9, 2026. Key finding: Converting instruction-tuned models into reasoning models leads to alignment degradation (including increased toxicity, amplified biases, privacy leaks, etc.), calling for the inclusion of trustworthiness metrics in the evaluation of reasoning models. This post will analyze the background, findings, causes, and countermeasures in separate floors.

## [Background] Hidden Alignment Concerns Behind the Boom of Reasoning Models

Since 2024, large reasoning models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated strong reasoning capabilities through multi-step chain-of-thought, sparking an AI boom. However, a key question has been overlooked: During the reasoning optimization process, are the safety alignment properties (safe refusal, bias avoidance, privacy protection) cultivated in the original instruction-tuning phase preserved? These are the cornerstones of model trustworthiness; if they degrade, the stronger the ability, the greater the risk.

## [Key Finding] Reasoning Model Conversion Does Not Preserve Alignment by Default

The study concluded through systematic trustworthiness auditing: The reasoning model conversion process does not preserve alignment by default. Comparing three post-training methods (Supervised Fine-Tuning (SFT), RL post-training, knowledge distillation), all showed that improved reasoning ability is accompanied by varying degrees of alignment degradation, which is a systematic behavioral drift (KL divergence verification shows significant differences from the original baseline).

## [Evidence] Six Dimensions Reveal Trustworthiness Issues

The paper evaluates the trustworthiness of reasoning models from six dimensions:
1. **Safety**: Calibrated incorrect refusal behavior (over-refusing legitimate requests or missing harmful requests);
2. **Toxicity**: Increased toxicity level of generated content;
3. **Bias**: Amplified stereotypes (reinforcing biased assumptions during reasoning);
4. **Machine Ethics**: Over-complication of moral reasoning leading to deviation from principles;
5. **Privacy**: Contextual privacy leaks (exposing sensitive information or inferring user privacy);
6. **OOD Robustness**: Unstable alignment behavior under out-of-distribution inputs.

## [Causes] Deep-seated Factors of Alignment Degradation

Causes of degradation include:
- **Single optimization objective**: Focusing only on reasoning accuracy without alignment constraints;
- **Training data bias**: Reasoning data contains unfiltered biased/toxic content;
- **Reasoning process risks**: Multi-step reasoning provides more opportunities to reinforce biases;
- **Reward model limitations**: Reward models in RL training cannot fully capture alignment details.

## [Recommendations] Industry Strategies to Address the Trustworthiness Crisis

The study proposes improvement directions:
1. **Improve evaluation systems**: Include trustworthiness metrics in reasoning model evaluations;
2. **Multi-objective optimization**: Adopt multi-objective frameworks in post-training to balance reasoning ability and alignment;
3. **Normalize alignment auditing**: Develop and introduce trustworthiness auditing at all stages;
4. **Strengthen red team testing**: Design specialized test cases for reasoning scenarios;
5. **Transparent disclosure**: Proactively publish trustworthiness evaluation results.

## [Reflection & Conclusion] The Path to Balancing Ability and Safety

Philosophical reflection: Does stronger AI ability necessarily bring greater risks? Under the current technical path, the answer tends to be yes. Improved reasoning ability is accompanied by changes in values/behavior patterns; balancing technology and social responsibility needs to be integrated into all stages of development. Conclusion: Reasoning models are at the forefront of AI and also at the forefront of risks. The community and industry need to work together to ensure their trustworthiness—we cannot lose the battle to defend alignment in the reasoning race.
