Section 01
[Introduction] The Trustworthiness Crisis of Reasoning Models: Does Ability Improvement Sacrifice Alignment?
This post is based on the study Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models published on arXiv on June 9, 2026. Key finding: Converting instruction-tuned models into reasoning models leads to alignment degradation (including increased toxicity, amplified biases, privacy leaks, etc.), calling for the inclusion of trustworthiness metrics in the evaluation of reasoning models. This post will analyze the background, findings, causes, and countermeasures in separate floors.