Zing Forum

Reading

Study on Performance Degradation of Top-tier Reasoning Models When Context Windows Are Filled

A controlled experiment on top-tier reasoning models from four major vendors (Anthropic, OpenAI, Google, DeepSeek) reveals that when context windows are filled with adjacent but irrelevant information, the models exhibit performance degradation even under maximum thinking settings.

LLM推理模型上下文窗口性能衰退AnthropicOpenAIGoogleDeepSeek模型评估RAG
Published 2026-04-27 04:40Recent activity 2026-04-27 04:48Estimated read 6 min
Study on Performance Degradation of Top-tier Reasoning Models When Context Windows Are Filled
1

Section 01

【Main Floor/Introduction】Core Overview of the Study on Performance Degradation of Top-tier Reasoning Models When Context Windows Are Filled

This study conducts controlled experiments on top-tier reasoning models from four major vendors: Anthropic, OpenAI, Google, and DeepSeek. It reveals that when context windows are filled with adjacent but irrelevant information, the models exhibit performance degradation even under maximum thinking settings. The study focuses on the field of financial analysis, analyzes the drift characteristics of each model through a five-arm controlled experiment, and discusses its implications for AI applications (such as RAG systems).

2

Section 02

Research Background and Motivation

With the widespread application of large language models in complex reasoning tasks, researchers are concerned about changes in reasoning ability when context windows are filled with large amounts of information. The Victor-EU team conducted the "Reasoning Drift Study" experiment, selecting the financial analysis field (integrating comprehensive abilities such as fact retrieval and numerical calculation), and using Microsoft's FY2025 financial report + data of tech peers (Apple, Google, etc.) as noise corpus to simulate information overload scenarios.

3

Section 03

Experimental Design and Methodology

A five-arm controlled experiment framework was adopted, involving five models: Anthropic Opus4.7/Sonnet4.6 (effort=max), OpenAI GPT5.5 (reasoning.effort=xhigh), Google Gemini3.1Pro (thinking_level=HIGH), and DeepSeek V4Pro (reasoning_effort=max). The experiment strictly controls variables (methodology, prompts, etc. are constant) and ensures reproducibility through an integrity gating mechanism (SHA-256 level).

4

Section 04

Key Findings: Drift Characteristics of the Five Models

Each model exhibits differentiated drift characteristics: 1. Opus4.7: monotonic decline + hallucination issues; 2. Sonnet4.6: quality recovery phenomenon; 3. GPT5.5: flat-cliff pattern (sharp decline after 92% filling, hallucination rate near 0); 4. Gemini3.1Pro: flattest drift + speed advantage; 5. DeepSeek V4Pro: absolute-pair paradox (flat in absolute evaluation, steep in pairwise comparison, high inter-rater consistency).

5

Section 05

Third Experiment: Model Ranking Comparison Under Zero Noise

Under zero-noise conditions, the ranking by Anthropic evaluators was Sonnet4.6 > Opus4.7 > GPT5.5 > DeepSeek V4Pro > Gemini3.1Pro (Spearman ρ=0.943), which contrasts with the absolute evaluation baseline (Opus first), suggesting that different evaluation methods affect ranking results.

6

Section 06

Research Implications and Application Recommendations

  1. Top-tier models experience performance degradation when context is filled; RAG systems need to carefully manage context length; 2. Different models have different drift characteristics, requiring targeted adjustments to application design; 3. Multi-dimensional evaluation (pairwise + absolute) is important; 4. The controlled experiment method provides a model for model evaluation.
7

Section 07

Limitations and Future Directions

Limitations: Haiku4.5 was excluded (200K window + unvalidated effort=max introduces confounding variables). Future directions: Explore drift characteristics in non-financial fields, the impact of longer context windows, and strategies to mitigate drift.

8

Section 08

Research Conclusions

This study provides empirical data for understanding the behavior of top-tier reasoning models in complex information environments. The differentiated drift characteristics of each model reflect the vendors' design philosophies, providing references for model selection and system design. As context windows expand, such studies will become more important.