# TreeDDx: Evaluating Large Language Models' Differential Diagnosis Reasoning Ability Using Structured Clinical Decision Trees

> TreeDDx is a benchmark framework for large language models (LLMs) on clinical differential diagnosis tasks, evaluating models' reasoning ability and diagnostic accuracy via structured decision trees.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-03T14:06:35.000Z
- 最近活动: 2026-06-03T14:49:33.012Z
- 热度: 157.3
- 关键词: 大语言模型, 医疗AI, 鉴别诊断, 临床决策树, 基准测试, 推理评估, 医学NLP
- 页面链接: https://www.zingnex.cn/en/forum/thread/treeddx
- Canonical: https://www.zingnex.cn/forum/thread/treeddx
- Markdown 来源: floors_fallback

---

## Introduction: TreeDDx – A New Framework for Evaluating LLMs' Clinical Differential Diagnosis Reasoning Ability

TreeDDx is a benchmark framework for large language models (LLMs) on clinical differential diagnosis tasks, evaluating models' reasoning ability and diagnostic accuracy using structured decision trees. It addresses the problem that existing medical benchmarks struggle to assess complex clinical decision-making reasoning chains, making the reasoning process of LLMs traceable and quantifiable.

## Background: Reasoning Challenges in Differential Diagnosis for Medical AI

LLMs perform well in medical question answering and knowledge retrieval tasks, but in real clinical scenarios, differential diagnosis requires models to have structured reasoning ability (screening the most likely diagnosis from multiple diseases). Existing benchmarks mostly focus on single-turn question answering or knowledge recall, making it difficult to evaluate complex reasoning chains. TreeDDx was created to fill this gap, introducing an evaluation paradigm based on clinical decision trees.

## Core Design of TreeDDx: Decision Tree Generation and Matching

TreeDDx formalizes differential diagnosis as a decision tree generation and matching problem: 1. Structured decision tree representation (nodes are diagnostic hypotheses, edges are supporting/excluding evidence); 2. Comparison of model outputs with expert-annotated ground truth decision trees; 3. Multi-dimensional evaluation (correctness of final diagnosis, rationality of reasoning path, coverage of key nodes, integrity of logical chain). It can capture reasoning flaws missed by traditional metrics (e.g., guessing the diagnosis correctly but with reasoning jumps).

## Dataset and Experimental Components Description

TreeDDx data comes from real challenging cases in the JAMA Network Clinical Challenge, with preprocessing scripts provided to convert them into decision tree samples. Key components include: gt_decisiontree_generation.py (generates standard decision trees), llm_decisiontree_generation.py (calls LLMs to generate decision trees), and evaluation.py (similarity calculation and multi-dimensional evaluation). Note: Original JAMA cases require users to obtain authorization on their own.

## Technical Core: Decision Tree Similarity Evaluation Method

Decision tree similarity evaluation is the core, using a combination of graph edit distance (structural level) and node semantic similarity (calculated by medical pre-trained models): topological similarity (node hierarchy, branches) is computed at the structural level; node text similarity at the semantic level; a weighted combination gives the matching score. This more finely characterizes differences in reasoning ability than traditional methods.

## Application Value: Implications for Medical AI Development

The value of TreeDDx: 1. Emphasizes the interpretability of medical AI (decision trees are naturally interpretable); 2. Diagnoses model flaws at a fine-grained level (locates weaknesses in specific diseases or reasoning links); 3. Versatility (transferable to complex reasoning fields such as law and engineering).

## Limitations and Future Improvement Directions

Limitations of TreeDDx: It relies on the in-context learning ability of LLMs; incomplete decision trees may be generated for challenging cases; ground truth annotation requires a lot of expert time, limiting dataset size. Future directions: Introduce stronger LLMs to generate decision trees, develop semi-automated annotation tools, and combine RLHF to optimize reasoning ability.

## Conclusion: Significance and Paradigm Reference of TreeDDx

TreeDDx is an important advancement in medical AI evaluation methods, formalizing the complex clinical cognitive process of differential diagnosis into a computable and comparable structured task. For medical LLM researchers and developers, it is not only a benchmark tool but also a paradigm reference for evaluating and optimizing models' clinical reasoning ability.
