In recent years, large reasoning models (LRMs) represented by OpenAI o1 and DeepSeek-R1 have achieved remarkable results in complex tasks such as mathematical reasoning and code generation. The core secret of these models lies in the adoption of the long Chain-of-Thought (CoT) reasoning mechanism—they generate a detailed internal thinking process before giving the final answer.
However, this "deliberative" approach also comes with obvious costs: the reasoning process is extremely lengthy. Models often generate a large number of redundant thinking steps, leading to the so-called "overthinking" phenomenon. This not only increases reasoning latency and computational costs but may also cause models to "overthink" simple problems, thereby reducing efficiency.
Existing solutions mostly use the GRPO (Group Relative Policy Optimization) algorithm to compress output length, but these methods adopt static length reward design, which cannot adaptively adjust according to problem difficulty and response length distribution. As a result, over-compression often leads to decreased accuracy, or insufficient compression leads to limited efficiency improvement.