Reading

Large Language Model Practical Test: Performance Comparison in Movie Retrieval, Long Text Understanding, and Image Transcription

Based on yixy's LLM benchmark project, this article deeply analyzes the performance differences of mainstream large models such as DeepSeek, Gemini, and Doubao in tasks like movie information retrieval, long text semantic understanding, and image structure transcription, providing empirical references for model selection.

大语言模型基准测试DeepSeekGemini豆包模型评估多模态长文本理解ChatGPTAI对比

Published 2026-06-05 08:39Recent activity 2026-06-05 08:52Estimated read 6 min

Large Language Model Practical Test: Performance Comparison in Movie Retrieval, Long Text Understanding, and Image Transcription

Section 01

[Introduction] Empirical Analysis of Multi-Task Performance Comparison of Mainstream Large Language Models

Based on the open-source project llm-benchmark maintained by yixy (Source: GitHub, published in June 2026), this article conducts a horizontal comparison of the performance of mainstream large language models such as DeepSeek, Gemini, and Doubao in three tasks: movie information retrieval, long text semantic understanding, and image structure transcription, providing empirical references for model selection. The test was conducted in May 2026.

Section 02

Background: Why Do We Need Large Language Model Benchmark Tests?

With the emergence of models like ChatGPT and DeepSeek, developers face the problem of model selection—model promotional highlights and actual performance vary by task. llm-benchmark reveals the capability differences of different models in practical applications through targeted test cases, helping users choose models that suit their needs.

Section 03

Test Objects and Methods

Test Objects:

Model	Provider	Test Version
DeepSeek	DeepSeek	Expert Mode
Gemini	Google	3.1 Pro
Doubao	ByteDance	Expert + Super Power Mode
ChatGPT	OpenAI	-
Tencent Yuanbao	Tencent	-

The tests cover three dimensions: movie information retrieval, long text semantic understanding, and image structure transcription.

Section 04

Evidence 1: Performance in Movie Information Retrieval Task

Test Design: Through vague movie plot descriptions (2000-2012 American sci-fi films, AI intervening in life, fake videos replacing the government, etc.), the models are required to identify the movie and output JSON results. Results:

DeepSeek: Stably identified Eyeborgs (confidence 100%), alternative Eagle Eye (60%), standardized format;
Doubao: Occasionally correct but poor stability, long response time. Key Findings: DeepSeek is more stable and reliable in knowledge reasoning tasks.

Section 05

Evidence 2: Performance in Long Text Semantic Understanding Task

Test Design: Using Romance of the Three Kingdoms text where "Liu Bei" is replaced with "Ma Bei", the models are required to output a summary and sentences containing "Da Sima said". Results:

DeepSeek: Did not identify the replacement, extracted 4 references, good completeness;
Gemini: Did not identify the replacement, extracted only 3 references, complete summary. Key Findings: Models have insufficient sensitivity to abnormal patterns in artificially modified text, and there is still room for improvement in long text detail extraction.

Section 06

Evidence 3: Performance in Image Recognition and Structure Transcription Task

Test Design: Convert tree structure diagrams into Mermaid charts and ASCII flowcharts. Results:

Mermaid format: All models failed;
ASCII flowchart: Gemini performed best, able to clearly present hierarchical relationships and connection methods. Key Findings: Gemini's native multimodal capability is leading, but models still have limitations in precise structured output (such as Mermaid).

Section 07

Conclusions and Model Selection Recommendations

DeepSeek: Advantages: Stable knowledge retrieval/reasoning, complete long text detail extraction, fast response; Applicable scenarios: Knowledge Q&A, literature retrieval, production environment. Gemini: Advantages: Leading multimodal capability, high-quality text generation; Applicable scenarios: Image-text mixed tasks, image analysis, creative writing. Doubao: Advantages: Good Chinese optimization, rich functions; Notes: Slow response in complex reasoning, stability needs improvement; Applicable scenarios: Chinese dialogue, daily Q&A.

Section 08

Enlightenment from Test Methodology and Conclusion

Methodological Enlightenment:

Task design needs to be targeted (focus on practical scenarios);
Adversarial testing (such as replacing names) can expose model robustness;
Multi-dimensional evaluation is needed (knowledge, reasoning, multimodal, etc.). Conclusion: There is no all-purpose model; choose according to needs. Current models have limitations, and community-collaborated benchmark tests help understand the boundary of model capabilities.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49