Zing Forum

Reading

Detection-Extraction Gap: Large Models Already Know the Answer But Can't Output It

This article reveals the "Detection-Extraction Gap" phenomenon in large reasoning models: models determine the answer early in the chain of thought, but forced decoding fails to extract it; the proposed BAEE method can truncate 70-78% of generation and improve accuracy by 1-5 percentage points (pp).

大语言模型推理优化早期退出思维链检测-提取鸿沟BAEE推理效率解码策略
Published 2026-04-08 10:47Recent activity 2026-04-09 10:12Estimated read 6 min
Detection-Extraction Gap: Large Models Already Know the Answer But Can't Output It
1

Section 01

[Main Floor] Detection-Extraction Gap: Large Models Already Know the Answer But Struggle to Output It; BAEE Method Enables Efficient Reasoning

This article reveals the existence of the "Detection-Extraction Gap" phenomenon in large reasoning models: models determine the answer early in the chain of thought, but standard decoding fails to extract it; the proposed BAEE method can truncate 70-78% of generation and improve accuracy by 1-5 pp. This discussion will be divided into floors covering background, evidence, method, results, etc.

2

Section 02

[Background] What is the Detection-Extraction Gap?

When large models generate chains of thought, they often exhibit the phenomenon of "continuing to generate redundant content after figuring out the answer". The research team named this the "Detection-Extraction Gap":

  • Detection: Through internal states or free continuation, it can be determined that the model already "knows" the answer early in the chain of thought;
  • Extraction: Standard prompt-conditioned decoding (forced extraction) often fails. In short, the model has internally determined the answer, but standard methods cannot effectively obtain it.
3

Section 03

[Evidence] Experimental Data Verifies the Existence of the Gap

Experimental data supports the existence of the gap:

  1. Analysis of 5 model configurations, 2 families, and 3 benchmarks found that 52%-88% of chain-of-thought tokens are redundant content generated after the answer is determined;
  2. Truncating the first 10% prefix of the chain of thought, free continuation can recover the correct answer, but forced extraction (e.g., asking "Based on the above reasoning, what is the answer?") has a failure rate of up to 42%;
  3. Theoretically, total variation boundary analysis shows that the conditional constraints of forced extraction change the output distribution, interrupt the natural reasoning trajectory, and lead to failure.
4

Section 04

[Method] BAEE: Black-box Adaptive Early Exit Strategy

BAEE (Black-box Adaptive Early Exit) is a black-box efficient reasoning method that leverages the gap. Its core steps are:

  1. Detect Answer Readiness: During generation, periodically use lightweight free continuation tests to determine whether the model is ready to output the answer;
  2. Extract and Terminate: Once readiness is detected, extract the answer via free continuation and stop generation immediately to avoid redundant content.
5

Section 05

[Results] BAEE Brings Significant Efficiency and Performance Improvements

BAEE has significant effects:

  • Generation Truncation Rate: 70%-78%, greatly reducing redundant tokens;
  • Accuracy Improvement: 1-5 pp on all tested models, with explicit thinking mode models (e.g., DeepSeek-R1) reaching up to 5.8 pp;
  • Cost Optimization: Variants only require a median of 9 API calls, achieving 52%-73% truncation and balancing cost and efficiency.
6

Section 06

[Implications and Applications] Value for Model Design and Practical Scenarios

Implications and Applications: Model Design:

  • Reconsider the role of chain of thought: Longer chains are not necessarily deeper; redundant tokens may be a sign of inability to stop in time;
  • Optimize decoding strategies: Need smarter strategies to identify answer readiness states;
  • Adjust training objectives: Introduce early exit objectives to enable models to organize reasoning more efficiently. Practical Applications:
  • Reduce API costs (cut token consumption by over 70%);
  • Reduce response latency and improve real-time interaction experience;
  • Avoid lengthy reasoning displays and optimize user experience.
7

Section 07

[Limitations and Outlook] Future Research Directions

Limitations and Future Directions: Limitations:

  • Detection frequency and timing need further optimization;
  • Some tasks (e.g., multi-step mathematical proofs) require more cautious early exit strategies; Future:
  • Study optimal detection points to balance overhead and exit opportunities;
  • Explore applicability to different task types;
  • Combine model internal states (white-box methods) to improve detection accuracy.