Section 01
Introduction: White-Box Diagnosis and Repair Framework for Instruction Hierarchy Failures
This article introduces a white-box diagnostic framework for instruction hierarchy failures in reasoning language models, which precisely locates failures into three stages: instruction recognition, conflict resolution, and response implementation. It also proposes two training-free self-monitoring mechanisms that can reduce violation rates by 81-99%. This research comes from an arXiv paper (published in June 2026) and is of great significance for AI safety.