# Large Language Model Practical Test: Performance Comparison in Movie Retrieval, Long Text Understanding, and Image Transcription

> Based on yixy's LLM benchmark project, this article deeply analyzes the performance differences of mainstream large models such as DeepSeek, Gemini, and Doubao in tasks like movie information retrieval, long text semantic understanding, and image structure transcription, providing empirical references for model selection.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-05T00:39:53.000Z
- 最近活动: 2026-06-05T00:52:50.027Z
- 热度: 163.8
- 关键词: 大语言模型, 基准测试, DeepSeek, Gemini, 豆包, 模型评估, 多模态, 长文本理解, ChatGPT, AI对比
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-yixy-llm-benchmark
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-yixy-llm-benchmark
- Markdown 来源: floors_fallback

---

## [Introduction] Empirical Analysis of Multi-Task Performance Comparison of Mainstream Large Language Models

Based on the open-source project llm-benchmark maintained by yixy (Source: GitHub, published in June 2026), this article conducts a horizontal comparison of the performance of mainstream large language models such as DeepSeek, Gemini, and Doubao in three tasks: movie information retrieval, long text semantic understanding, and image structure transcription, providing empirical references for model selection. The test was conducted in May 2026.

## Background: Why Do We Need Large Language Model Benchmark Tests?

With the emergence of models like ChatGPT and DeepSeek, developers face the problem of model selection—model promotional highlights and actual performance vary by task. llm-benchmark reveals the capability differences of different models in practical applications through targeted test cases, helping users choose models that suit their needs.

## Test Objects and Methods

**Test Objects**:
| Model | Provider | Test Version |
|------|--------|----------|
| DeepSeek | DeepSeek | Expert Mode |
| Gemini | Google | 3.1 Pro |
| Doubao | ByteDance | Expert + Super Power Mode |
| ChatGPT | OpenAI | - |
| Tencent Yuanbao | Tencent | - |

The tests cover three dimensions: movie information retrieval, long text semantic understanding, and image structure transcription.

## Evidence 1: Performance in Movie Information Retrieval Task

**Test Design**: Through vague movie plot descriptions (2000-2012 American sci-fi films, AI intervening in life, fake videos replacing the government, etc.), the models are required to identify the movie and output JSON results.
**Results**:
- DeepSeek: Stably identified *Eyeborgs* (confidence 100%), alternative *Eagle Eye* (60%), standardized format;
- Doubao: Occasionally correct but poor stability, long response time.
**Key Findings**: DeepSeek is more stable and reliable in knowledge reasoning tasks.

## Evidence 2: Performance in Long Text Semantic Understanding Task

**Test Design**: Using *Romance of the Three Kingdoms* text where "Liu Bei" is replaced with "Ma Bei", the models are required to output a summary and sentences containing "Da Sima said".
**Results**:
- DeepSeek: Did not identify the replacement, extracted 4 references, good completeness;
- Gemini: Did not identify the replacement, extracted only 3 references, complete summary.
**Key Findings**: Models have insufficient sensitivity to abnormal patterns in artificially modified text, and there is still room for improvement in long text detail extraction.

## Evidence 3: Performance in Image Recognition and Structure Transcription Task

**Test Design**: Convert tree structure diagrams into Mermaid charts and ASCII flowcharts.
**Results**:
- Mermaid format: All models failed;
- ASCII flowchart: Gemini performed best, able to clearly present hierarchical relationships and connection methods.
**Key Findings**: Gemini's native multimodal capability is leading, but models still have limitations in precise structured output (such as Mermaid).

## Conclusions and Model Selection Recommendations

**DeepSeek**: Advantages: Stable knowledge retrieval/reasoning, complete long text detail extraction, fast response; Applicable scenarios: Knowledge Q&A, literature retrieval, production environment.
**Gemini**: Advantages: Leading multimodal capability, high-quality text generation; Applicable scenarios: Image-text mixed tasks, image analysis, creative writing.
**Doubao**: Advantages: Good Chinese optimization, rich functions; Notes: Slow response in complex reasoning, stability needs improvement; Applicable scenarios: Chinese dialogue, daily Q&A.

## Enlightenment from Test Methodology and Conclusion

**Methodological Enlightenment**:
1. Task design needs to be targeted (focus on practical scenarios);
2. Adversarial testing (such as replacing names) can expose model robustness;
3. Multi-dimensional evaluation is needed (knowledge, reasoning, multimodal, etc.).
**Conclusion**: There is no all-purpose model; choose according to needs. Current models have limitations, and community-collaborated benchmark tests help understand the boundary of model capabilities.
