Zing Forum

Reading

RareScopeDB: Identifying Knowledge Gaps and Confident Hallucinations of Large Language Models in the Rare Disease Domain Using Perplexity and Disease Knowledge Score

RareScopeDB systematically evaluates the knowledge status of large language models (LLMs) in the rare disease domain by combining Perplexity (PPL) and Disease Knowledge Score (DKS). It classifies disease-level knowledge into four categories: stable knowledge, knowledge gaps, confident hallucinations, and unstable knowledge, providing a crucial diagnostic tool for the safe application of medical AI.

大语言模型罕见病医疗AI幻觉检测知识盲区困惑度HPO本体机器学习评估AI安全
Published 2026-05-13 13:21Recent activity 2026-05-13 13:31Estimated read 7 min
RareScopeDB: Identifying Knowledge Gaps and Confident Hallucinations of Large Language Models in the Rare Disease Domain Using Perplexity and Disease Knowledge Score
1

Section 01

Introduction: RareScopeDB—A Tool for Identifying Knowledge Gaps and Hallucinations of LLMs in Rare Diseases

RareScopeDB systematically evaluates the knowledge status of large language models (LLMs) in the rare disease domain by combining Perplexity (PPL) and Disease Knowledge Score (DKS). It classifies disease-level knowledge into four categories: stable knowledge, knowledge gaps, confident hallucinations, and unstable knowledge, providing a crucial diagnostic tool for the safe application of medical AI. This project is based on HPO-associated data of 9171 rare diseases, offering open-source datasets and toolchains, and has developed an online browser to facilitate practical applications.

2

Section 02

Background and Motivation: Challenges of LLM Applications in the Rare Disease Domain

Large language models (LLMs) are widely used in the medical field, but rare diseases—due to scarce data and complex knowledge—have become a critical scenario for reliability testing. There are over 7000 rare diseases globally, and the extremely low number of cases for each leads to insufficient training data. Two risks may arise when consulting models: confident but incorrect answers (confident hallucinations) and uncertainty about known knowledge (unstable knowledge), both of which may delay diagnosis. RareScopeDB aims to quantitatively evaluate the knowledge gaps of LLMs in the rare disease domain, provide a basis for the safe deployment of medical AI, and establish an evaluation framework based on HPO-associated data of 9171 rare diseases.

3

Section 03

Core Methodology: Dual-Indicator Evaluation System and Four Knowledge States

RareScopeDB combines two indicators: Perplexity (PPL) and Disease Knowledge Score (DKS):

  • Perplexity (PPL):Measures the uncertainty of the model in predicting tokens; a high PPL indicates high uncertainty when the model generates information about the disease.
  • Disease Knowledge Score (DKS):Quantifies the consistency between the content generated by the model and the standard knowledge from HPO. Through cross-analysis, four knowledge states are classified:
  • Stable Knowledge: Low PPL + High DKS—The model has sufficient knowledge and high confidence.
  • Knowledge Gaps: High PPL + Low DKS—The model lacks knowledge and recognizes the uncertainty.
  • Confident Hallucinations: Low PPL + Low DKS—The model lacks knowledge but answers confidently (most dangerous).
  • Unstable Knowledge: High PPL + High DKS—The model possesses knowledge but expresses it with uncertainty.
4

Section 04

Datasets and Toolchain: Open-Source Resources Supporting Evaluation Work

Core Datasets

  • RareScopeDB.xlsx: Contains analysis tables for 9171 rare diseases, including disease identifiers, names, reference information, PPL/DKS percentiles, and knowledge state classifications.
  • qwen3.6-35b-a3b_raw.xlsx: Raw prompt and output data for structured knowledge evaluation.
  • Downstream Diagnosis Question Sets: FGDD and RareBench (including subsets like HMS) are used for evaluation in real diagnostic scenarios.

Analysis Tools

A complete Jupyter Notebook workflow is provided:

  • perplexity_pipeline.ipynb: Model query and token probability collection.
  • phenotype_tool.ipynb: Standardization of HPO phenotypic terms.
  • results_analyze1.ipynb: Calculation of performance metrics, DKS computation, knowledge state assignment, and downstream diagnostic analysis.
  • Diagnosis.ipynb: Downstream evaluation of rare disease diagnostic reasoning.
5

Section 05

Practical Application: Online Browser Facilitating Multi-Role Usage

RareScopeDB provides an online browser (https://bioinf.org.cn:8055/) where users can query the knowledge status of specific rare diseases, the quality of phenotype generation, and the accuracy of gene associations. This tool is valuable for the following roles:

  • Model Developers: Identify systematic defects in models and guide improvements.
  • Medical AI Product Managers: Understand the boundary of model capabilities and design human-machine collaboration processes.
  • Clinicians: Evaluate the credibility of AI-assisted diagnosis and decide when to seek a second opinion.
6

Section 06

Research Significance and Future Outlook: An Important Paradigm for Medical AI Safety

RareScopeDB provides an important paradigm for the interpretability and safety evaluation of medical AI. It reveals that even advanced LLMs still have significant knowledge gaps in the rare disease domain, especially the problem of 'confident hallucinations' (incorrect but credible outputs), which has a profound impact on clinical deployment. There is a need to establish confidence calibration mechanisms and human-machine responsibility boundaries. In the future, it can be extended to fields such as rare tumors and genetic metabolic diseases, providing directions for model improvements like retrieval-augmented generation and domain-adaptive training.