# ELTLM-Bench: A New Benchmark for Evaluating Temporal Multimodal Large Models in Healthcare

> This article introduces the ELTLM-Bench project, the first comprehensive benchmark focusing on evaluating the time perception and reasoning capabilities of large language multimodal models in healthcare longitudinal temporal scenarios, which has been accepted by ACL 2026 Findings.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-17T12:43:46.000Z
- 最近活动: 2026-04-17T12:57:36.833Z
- 热度: 159.8
- 关键词: ELTLM, 医疗AI, 多模态模型, 时序评估, MIMIC-CXR, 临床场景, ACL 2026, 基准测试
- 页面链接: https://www.zingnex.cn/en/forum/thread/eltlm-bench
- Canonical: https://www.zingnex.cn/forum/thread/eltlm-bench
- Markdown 来源: floors_fallback

---

## [Introduction] ELTLM-Bench: A New Benchmark for Evaluating Temporal Multimodal Large Models in Healthcare

ELTLM-Bench is the first comprehensive benchmark focusing on evaluating the time perception and reasoning capabilities of large language multimodal models in healthcare longitudinal temporal scenarios, and it has been accepted by ACL 2026 Findings. This benchmark fills the gap of traditional static evaluations that ignore the temporal dimension in healthcare, provides high-quality temporal datasets and a hierarchical evaluation system, and reveals key limitations of current SOTA models in temporal understanding, serving as an important evaluation tool for the development of healthcare AI.

## Research Background and Motivation: Temporal Challenges in Healthcare AI

Clinical diagnosis relies on dynamic temporal information (e.g., comparing image changes at different time points), but current mainstream evaluation benchmarks for healthcare multimodal models have limitations such as being dominated by static evaluations, lacking temporal dimensions, and insufficient clinical authenticity. Longitudinal temporal evaluation requires models to have time perception, change detection, trend reasoning, and causal association capabilities, which are crucial for scenarios like chronic disease management.

## Core Contributions of ELTLM-Bench

1. High-quality temporal dataset: Based on MIMIC-CXR, it undergoes strict screening, temporal alignment, clinical validation, and complies with privacy regulations; 2. Hierarchical evaluation system: The first level (temporal difference question answering) tests basic perception, and the second level (temporal reasoning question answering) tests advanced reasoning; 3. In-depth insights: Reveals limitations of SOTA models such as insufficient temporal attention, broken reasoning chains, and difficulties in long-term temporal modeling.

## Technical Implementation Details

**Data Construction Process**: Case screening → Time window definition → Pair generation → Question generation → Expert validation; **Evaluation Metrics**: Accuracy (accuracy rate/F1), temporal sensitivity (alignment accuracy), reasoning quality (step completeness), clinical relevance (expert score); **Model Testing**: Supports zero-shot, few-shot prompting, and chain-of-thought testing.

## Experimental Results and Key Findings

**Model Performance**: Temporal tasks are significantly more difficult than static tasks (accuracy is 15-20% lower), reasoning tasks are the weak link, and model size has a non-linear correlation with temporal ability; **Error Patterns**: Confusion of time order, over-focus on the current state, hallucinatory associations, and reasoning jumps.

## Data Access and Usage Guidelines

The ELTLM-Bench dataset is released on Hugging Face (URL: https://huggingface.co/datasets/Chengfeng233/ELTLM-Bench). Access requires PhysioNet permission, CITI training, and is for research use only; since it is based on MIMIC-CXR, separate access to MIMIC-CXR is required (URL: https://physionet.org/content/mimic-cxr/2.1.0/).

## Academic Contributions and Future Directions

The paper has been accepted by ACL 2026 Findings, filling the gap in evaluation, promoting research progress, and facilitating clinical collaboration. Future plans include expanding the dataset (more modalities/diseases/temporal spans/multi-center), enriching evaluation dimensions (uncertainty, interpretability, etc.), and continuously maintaining the benchmark.

## Summary and Implications for Healthcare AI Development

ELTLM-Bench is a milestone in healthcare AI evaluation, revealing model limitations and guiding directions. Implications: Healthcare AI evaluation needs to shift to dynamic temporal aspects, task design should be close to clinical practice, and interdisciplinary collaboration is crucial. The project's open-source nature and ethical norms set an example for the industry.