Zing Forum

Reading

Pathology LLM Benchmarks Are Underestimated: How Input Design Choices Determine Performance

A systematic analysis reveals that the "underperformance" of general-purpose LLMs in pathology tasks is largely due to suboptimal input configurations. By optimizing design choices such as tile size and magnification, GPT-5's accuracy in cancer classification tasks jumped from 15.1% to 39.5%, challenging the traditional perception of the necessity of specialized models.

医学AI病理学多模态LLM基准测试输入配置全切片图像模型评估配置优化医疗影像
Published 2026-06-11 01:59Recent activity 2026-06-11 11:28Estimated read 6 min
Pathology LLM Benchmarks Are Underestimated: How Input Design Choices Determine Performance
1

Section 01

[Introduction] Pathology LLM Benchmarks Are Underestimated: Input Configuration Optimization Upends Traditional Perceptions

The core argument of this article: The "underperformance" of general-purpose LLMs in pathology tasks does not stem from insufficient model capabilities, but from suboptimal input configuration choices. By optimizing design aspects like tile size and magnification (e.g., large tiles + low magnification + joint processing), GPT-5's accuracy in cancer classification tasks jumped from 15.1% to 39.5%, challenging the traditional perception of the necessity of specialized models.

2

Section 02

Background: Benchmarking Dilemmas in Pathology AI

Digital pathology relies on high-resolution whole-slide images (WSIs), but existing benchmarks commonly use methods of independent small tile processing (integrated via majority voting) and high-magnification priority. In this setup, general-purpose LLMs perform far worse than specialized models, and the industry generally believes that pathology tasks require domain-specific models. However, the study questions: Does the gap stem from input configurations rather than model capabilities?

3

Section 03

Key Findings: Input Design Factors and Optimal Configurations

The study analyzes four key input factors: reasoning mode (independent/joint), tile size (small/large), magnification (high/low), and number of tiles (few/many). The optimal configuration is large tiles + low magnification + joint processing: large tiles preserve tissue structure and context, low magnification provides a macro view and is efficient, and joint processing allows the model to autonomously integrate cross-tile information.

4

Section 04

Evidence: Significant Effects of Configuration Optimization

GPT-5 test results: TCGA cancer classification (baseline 15.1% → optimized 39.5%, 162% improvement), GTEx organ classification (38.1% → 62.9%, 65% improvement). Cross-model generalization: Gemini3 Flash improved by 23.4% on CPTAC, similar results for Claude3.5 Sonnet; cross-dataset generalization: effective on CPTAC without tuning; task-specific optimization further improved performance (TCGA reached 43.9%, GTEx reached 71.6%).

5

Section 05

Analysis of Why Traditional Configurations Fail

Defects of traditional configurations (small tiles + high magnification + independent processing): High magnification trap (focuses only on cellular details, loses global context, limits number of tiles); small tile limitations (information fragmentation, loss of spatial relationships, rough voting mechanism); independent processing flaws (inability to reason across tiles, information redundancy, difficulty resolving contradictions).

6

Section 06

Implications for Pathology AI Research

  1. Reassess specialized models: General-purpose model benchmarks are underestimated; specialized models still have value but their advantages may lie in stability; 2. Improve benchmarking: Standardize configuration reporting, multi-configuration evaluation, ablation studies; 3. Cross-domain implications: Applicable to medical imaging in radiology, dermatology, etc., long document processing, and multimodal tasks.
7

Section 07

Practical Recommendations

Researchers: Report complete configuration details, conduct configuration ablation, compare baseline and optimized configurations; Developers: Do not blindly follow tradition, experiment with different configurations, leverage model integration capabilities; Clinical deployment: Standardize configurations, continuous monitoring, integrate multiple configurations to improve robustness.

8

Section 08

Limitations and Future Directions

Current limitations: Task scope (classification only), model scope (focused on GPT/Gemini/Claude), high computational cost, insufficient interpretability. Future directions: Adaptive configurations, hierarchical processing (low-magnification global + high-magnification details), attention-guided tile selection, configuration migration, theoretical analysis from an information theory perspective.