Zing Forum

Reading

Does Reasoning Ability Come at the Cost of Alignment? The Trustworthiness Crisis of Large Reasoning Models

Studies have found that converting instruction-tuned models into reasoning models often leads to alignment degradation, including increased toxicity, amplified biases, and privacy leaks. This calls for the inclusion of trustworthiness metrics in the evaluation of reasoning models.

推理模型AI安全对齐性可信度偏见隐私保护模型评估
Published 2026-06-10 00:14Recent activity 2026-06-10 10:52Estimated read 6 min
Does Reasoning Ability Come at the Cost of Alignment? The Trustworthiness Crisis of Large Reasoning Models
1

Section 01

[Introduction] The Trustworthiness Crisis of Reasoning Models: Does Ability Improvement Sacrifice Alignment?

This post is based on the study Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models published on arXiv on June 9, 2026. Key finding: Converting instruction-tuned models into reasoning models leads to alignment degradation (including increased toxicity, amplified biases, privacy leaks, etc.), calling for the inclusion of trustworthiness metrics in the evaluation of reasoning models. This post will analyze the background, findings, causes, and countermeasures in separate floors.

2

Section 02

[Background] Hidden Alignment Concerns Behind the Boom of Reasoning Models

Since 2024, large reasoning models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated strong reasoning capabilities through multi-step chain-of-thought, sparking an AI boom. However, a key question has been overlooked: During the reasoning optimization process, are the safety alignment properties (safe refusal, bias avoidance, privacy protection) cultivated in the original instruction-tuning phase preserved? These are the cornerstones of model trustworthiness; if they degrade, the stronger the ability, the greater the risk.

3

Section 03

[Key Finding] Reasoning Model Conversion Does Not Preserve Alignment by Default

The study concluded through systematic trustworthiness auditing: The reasoning model conversion process does not preserve alignment by default. Comparing three post-training methods (Supervised Fine-Tuning (SFT), RL post-training, knowledge distillation), all showed that improved reasoning ability is accompanied by varying degrees of alignment degradation, which is a systematic behavioral drift (KL divergence verification shows significant differences from the original baseline).

4

Section 04

[Evidence] Six Dimensions Reveal Trustworthiness Issues

The paper evaluates the trustworthiness of reasoning models from six dimensions:

  1. Safety: Calibrated incorrect refusal behavior (over-refusing legitimate requests or missing harmful requests);
  2. Toxicity: Increased toxicity level of generated content;
  3. Bias: Amplified stereotypes (reinforcing biased assumptions during reasoning);
  4. Machine Ethics: Over-complication of moral reasoning leading to deviation from principles;
  5. Privacy: Contextual privacy leaks (exposing sensitive information or inferring user privacy);
  6. OOD Robustness: Unstable alignment behavior under out-of-distribution inputs.
5

Section 05

[Causes] Deep-seated Factors of Alignment Degradation

Causes of degradation include:

  • Single optimization objective: Focusing only on reasoning accuracy without alignment constraints;
  • Training data bias: Reasoning data contains unfiltered biased/toxic content;
  • Reasoning process risks: Multi-step reasoning provides more opportunities to reinforce biases;
  • Reward model limitations: Reward models in RL training cannot fully capture alignment details.
6

Section 06

[Recommendations] Industry Strategies to Address the Trustworthiness Crisis

The study proposes improvement directions:

  1. Improve evaluation systems: Include trustworthiness metrics in reasoning model evaluations;
  2. Multi-objective optimization: Adopt multi-objective frameworks in post-training to balance reasoning ability and alignment;
  3. Normalize alignment auditing: Develop and introduce trustworthiness auditing at all stages;
  4. Strengthen red team testing: Design specialized test cases for reasoning scenarios;
  5. Transparent disclosure: Proactively publish trustworthiness evaluation results.
7

Section 07

[Reflection & Conclusion] The Path to Balancing Ability and Safety

Philosophical reflection: Does stronger AI ability necessarily bring greater risks? Under the current technical path, the answer tends to be yes. Improved reasoning ability is accompanied by changes in values/behavior patterns; balancing technology and social responsibility needs to be integrated into all stages of development. Conclusion: Reasoning models are at the forefront of AI and also at the forefront of risks. The community and industry need to work together to ensure their trustworthiness—we cannot lose the battle to defend alignment in the reasoning race.