Zing Forum

Reading

Large Language Model Practical Test: Performance Comparison in Movie Retrieval, Long Text Understanding, and Image Transcription

Based on yixy's LLM benchmark project, this article deeply analyzes the performance differences of mainstream large models such as DeepSeek, Gemini, and Doubao in tasks like movie information retrieval, long text semantic understanding, and image structure transcription, providing empirical references for model selection.

大语言模型基准测试DeepSeekGemini豆包模型评估多模态长文本理解ChatGPTAI对比
Published 2026-06-05 08:39Recent activity 2026-06-05 08:52Estimated read 6 min
Large Language Model Practical Test: Performance Comparison in Movie Retrieval, Long Text Understanding, and Image Transcription
1

Section 01

[Introduction] Empirical Analysis of Multi-Task Performance Comparison of Mainstream Large Language Models

Based on the open-source project llm-benchmark maintained by yixy (Source: GitHub, published in June 2026), this article conducts a horizontal comparison of the performance of mainstream large language models such as DeepSeek, Gemini, and Doubao in three tasks: movie information retrieval, long text semantic understanding, and image structure transcription, providing empirical references for model selection. The test was conducted in May 2026.

2

Section 02

Background: Why Do We Need Large Language Model Benchmark Tests?

With the emergence of models like ChatGPT and DeepSeek, developers face the problem of model selection—model promotional highlights and actual performance vary by task. llm-benchmark reveals the capability differences of different models in practical applications through targeted test cases, helping users choose models that suit their needs.

3

Section 03

Test Objects and Methods

Test Objects:

Model Provider Test Version
DeepSeek DeepSeek Expert Mode
Gemini Google 3.1 Pro
Doubao ByteDance Expert + Super Power Mode
ChatGPT OpenAI -
Tencent Yuanbao Tencent -

The tests cover three dimensions: movie information retrieval, long text semantic understanding, and image structure transcription.

4

Section 04

Evidence 1: Performance in Movie Information Retrieval Task

Test Design: Through vague movie plot descriptions (2000-2012 American sci-fi films, AI intervening in life, fake videos replacing the government, etc.), the models are required to identify the movie and output JSON results. Results:

  • DeepSeek: Stably identified Eyeborgs (confidence 100%), alternative Eagle Eye (60%), standardized format;
  • Doubao: Occasionally correct but poor stability, long response time. Key Findings: DeepSeek is more stable and reliable in knowledge reasoning tasks.
5

Section 05

Evidence 2: Performance in Long Text Semantic Understanding Task

Test Design: Using Romance of the Three Kingdoms text where "Liu Bei" is replaced with "Ma Bei", the models are required to output a summary and sentences containing "Da Sima said". Results:

  • DeepSeek: Did not identify the replacement, extracted 4 references, good completeness;
  • Gemini: Did not identify the replacement, extracted only 3 references, complete summary. Key Findings: Models have insufficient sensitivity to abnormal patterns in artificially modified text, and there is still room for improvement in long text detail extraction.
6

Section 06

Evidence 3: Performance in Image Recognition and Structure Transcription Task

Test Design: Convert tree structure diagrams into Mermaid charts and ASCII flowcharts. Results:

  • Mermaid format: All models failed;
  • ASCII flowchart: Gemini performed best, able to clearly present hierarchical relationships and connection methods. Key Findings: Gemini's native multimodal capability is leading, but models still have limitations in precise structured output (such as Mermaid).
7

Section 07

Conclusions and Model Selection Recommendations

DeepSeek: Advantages: Stable knowledge retrieval/reasoning, complete long text detail extraction, fast response; Applicable scenarios: Knowledge Q&A, literature retrieval, production environment. Gemini: Advantages: Leading multimodal capability, high-quality text generation; Applicable scenarios: Image-text mixed tasks, image analysis, creative writing. Doubao: Advantages: Good Chinese optimization, rich functions; Notes: Slow response in complex reasoning, stability needs improvement; Applicable scenarios: Chinese dialogue, daily Q&A.

8

Section 08

Enlightenment from Test Methodology and Conclusion

Methodological Enlightenment:

  1. Task design needs to be targeted (focus on practical scenarios);
  2. Adversarial testing (such as replacing names) can expose model robustness;
  3. Multi-dimensional evaluation is needed (knowledge, reasoning, multimodal, etc.). Conclusion: There is no all-purpose model; choose according to needs. Current models have limitations, and community-collaborated benchmark tests help understand the boundary of model capabilities.