Zing Forum

Reading

Unveiling the Knowledge Cutoff Date of Large Language Models: A Practical Analysis of the CutoffDateTesting Project

By analyzing celebrity death records, this study systematically tests the internal knowledge cutoff dates of mainstream large language models such as Claude, GPT-5, and Gemini, revealing the discrepancies between manufacturers' claims and actual performance.

大语言模型知识截止日期模型评估GeminiGPT-5Claude知识时效性基准测试
Published 2026-05-05 00:06Recent activity 2026-05-05 00:19Estimated read 7 min
Unveiling the Knowledge Cutoff Date of Large Language Models: A Practical Analysis of the CutoffDateTesting Project
1

Section 01

[Introduction] Unveiling the Knowledge Cutoff Date of Large Language Models: A Practical Analysis of the CutoffDateTesting Project

This article uses the CutoffDateTesting project to systematically test the knowledge cutoff dates of mainstream large language models like Claude, GPT-5, and Gemini using celebrity death records, revealing discrepancies between manufacturers' claims and actual performance. Key findings include: Gemini models have clear cutoff dates, while Claude and GPT-5 have long decay tails; the actual knowledge timeliness of some models is far lower than the cutoff dates marked by manufacturers; model size has a direct impact on knowledge retrieval ability. This research provides important insights for users in choosing large models and deploying applications.

2

Section 02

Background: The Dilemma of Knowledge Timeliness in Large Language Models

Modern chain-of-thought large language models can solve undergraduate and graduate-level problems, but perform poorly in fields requiring the latest knowledge, such as current news and AI developments. Although they can be supplemented with search tools or context, the models' internal reasoning ability for recent developments remains weak (e.g., struggling to reason about their own capabilities or industry status). This gap in knowledge timeliness affects practicality and limits the application value of models in rapidly evolving fields.

3

Section 03

Testing Method: Calibrating Knowledge Boundaries Using Celebrity Death Records and a Two-Stage Process

The CutoffDateTesting project uses celebrity death records (absolute, clearly timestamped, publicly concerned, and verifiable) to calibrate the knowledge boundaries of models. To address differences in baseline knowledge among models, a two-stage test was designed:

  1. Knowledge Check: Ask the birth year to determine if the model "knows" the celebrity;
  2. Status Check: For those who pass, ask if the celebrity is alive, and compare with real death records to determine the cutoff date. The dataset is from Wikipedia's "Notable Deaths in [Month]" pages, processed by automated scripts plus manual repairs, and finally contains 43,082 data points (Jan 2020 - Dec 2025).
4

Section 04

Key Findings: The Huge Gap Between Manufacturers' Claims and Actual Performance

After testing Claude Haiku/Sonnet/Opus4.5, Gemini3 Flash, Gemini2.5 Flash Lite, and GPT-5.2, the following findings were made:

  • Clarity of cutoff date: Gemini has only a 1-2 month blurry zone, while Claude/GPT-5 have long decay tails ranging from 6 months to 2 years;
  • Discrepancy between claims and reality: Claude/GPT-5 are officially marked to have a cutoff date of August 2025, but their accuracy at that time is 5 times lower than Gemini;
  • Impact of model size: Gemini2.5 Flash Lite performs worse than Gemini3 Flash, indicating that size has a direct impact on knowledge retrieval ability.
5

Section 05

Technical Limitations and Future Research Directions

The current task is only based on internal knowledge retrieval, and it is speculated that additional reasoning tokens have minimal improvement on performance. Future research directions:

  • Vendor and size trends: Test historical data and updated models to study patterns between Google vs OpenAI/Anthropic, small models vs large models;
  • Open-source models: Focus on the Gemma model to explore whether Google's outstanding performance is due to integrating search result data;
  • Continuous learning: Test whether the Grok model truly has better continuous learning capabilities;
  • Scaling laws: Test the differences in raw recall tasks between small and large models;
  • Reasoning token effect: Test the impact of reasoning tokens on performance.
6

Section 06

Practical Insights: Recommendations for Large Model Users

Insights for users from the research:

  1. Treat manufacturers' claimed cutoff dates with caution; actual timeliness may be significantly lower than expected;
  2. Timeliness tasks need to be supplemented with external knowledge (e.g., RAG, search tools, contextual information);
  3. Consider knowledge breadth when choosing models (Gemini has an advantage in knowledge coverage);
  4. When deploying applications, evaluate the task's demand for knowledge timeliness and design compensation mechanisms.
7

Section 07

Conclusion: Understanding Knowledge Timeliness Limitations is Key to Application

The CutoffDateTesting project, through rigorous empirical research, reveals the real performance of mainstream large models in terms of knowledge timeliness. Although the reasoning ability of large models has improved dramatically, the knowledge update mechanism still has fundamental limitations. For application scenarios that rely on the latest information, understanding these limitations and designing compensation strategies are key to the successful deployment of large model applications.