Section 01
[Main Floor] Study on the Interpretability of Implicit Reasoning Models: Core Findings That Challenge Traditional Perceptions
This empirical study challenges the traditional perception that implicit reasoning models (LRMs) are uninterpretable. Key findings include: 1) The implicit reasoning tokens of LRMs are often unnecessary; removing them still yields the same answers. 2) Implicit tokens can be decoded into human-understandable reasoning traces (65-93% accuracy for correct samples). 3) Interpretability can serve as a signal for prediction correctness—correct predictions are easy to decode, while incorrect ones are hard. These findings provide a new perspective for evaluating the interpretability and reliability of LRMs.