Reading

RareScopeDB: Identifying Knowledge Gaps and Confident Hallucinations of Large Language Models in the Rare Disease Domain Using Perplexity and Disease Knowledge Score

大语言模型罕见病医疗AI幻觉检测知识盲区困惑度HPO本体机器学习评估AI安全

Published 2026-05-13 13:21Recent activity 2026-05-13 13:31Estimated read 7 min

RareScopeDB: Identifying Knowledge Gaps and Confident Hallucinations of Large Language Models in the Rare Disease Domain Using Perplexity and Disease Knowledge Score

Section 01

Introduction: RareScopeDB—A Tool for Identifying Knowledge Gaps and Hallucinations of LLMs in Rare Diseases

RareScopeDB systematically evaluates the knowledge status of large language models (LLMs) in the rare disease domain by combining Perplexity (PPL) and Disease Knowledge Score (DKS). It classifies disease-level knowledge into four categories: stable knowledge, knowledge gaps, confident hallucinations, and unstable knowledge, providing a crucial diagnostic tool for the safe application of medical AI. This project is based on HPO-associated data of 9171 rare diseases, offering open-source datasets and toolchains, and has developed an online browser to facilitate practical applications.

Section 02

Background and Motivation: Challenges of LLM Applications in the Rare Disease Domain

Large language models (LLMs) are widely used in the medical field, but rare diseases—due to scarce data and complex knowledge—have become a critical scenario for reliability testing. There are over 7000 rare diseases globally, and the extremely low number of cases for each leads to insufficient training data. Two risks may arise when consulting models: confident but incorrect answers (confident hallucinations) and uncertainty about known knowledge (unstable knowledge), both of which may delay diagnosis. RareScopeDB aims to quantitatively evaluate the knowledge gaps of LLMs in the rare disease domain, provide a basis for the safe deployment of medical AI, and establish an evaluation framework based on HPO-associated data of 9171 rare diseases.

Section 03

Core Methodology: Dual-Indicator Evaluation System and Four Knowledge States

RareScopeDB combines two indicators: Perplexity (PPL) and Disease Knowledge Score (DKS):

Perplexity (PPL)：Measures the uncertainty of the model in predicting tokens; a high PPL indicates high uncertainty when the model generates information about the disease.
Disease Knowledge Score (DKS)：Quantifies the consistency between the content generated by the model and the standard knowledge from HPO. Through cross-analysis, four knowledge states are classified:
Stable Knowledge: Low PPL + High DKS—The model has sufficient knowledge and high confidence.
Knowledge Gaps: High PPL + Low DKS—The model lacks knowledge and recognizes the uncertainty.
Confident Hallucinations: Low PPL + Low DKS—The model lacks knowledge but answers confidently (most dangerous).
Unstable Knowledge: High PPL + High DKS—The model possesses knowledge but expresses it with uncertainty.

Section 04

Datasets and Toolchain: Open-Source Resources Supporting Evaluation Work

Core Datasets

RareScopeDB.xlsx: Contains analysis tables for 9171 rare diseases, including disease identifiers, names, reference information, PPL/DKS percentiles, and knowledge state classifications.
qwen3.6-35b-a3b_raw.xlsx: Raw prompt and output data for structured knowledge evaluation.
Downstream Diagnosis Question Sets: FGDD and RareBench (including subsets like HMS) are used for evaluation in real diagnostic scenarios.

Analysis Tools

A complete Jupyter Notebook workflow is provided:

perplexity_pipeline.ipynb: Model query and token probability collection.
phenotype_tool.ipynb: Standardization of HPO phenotypic terms.
results_analyze1.ipynb: Calculation of performance metrics, DKS computation, knowledge state assignment, and downstream diagnostic analysis.
Diagnosis.ipynb: Downstream evaluation of rare disease diagnostic reasoning.

Section 05

Practical Application: Online Browser Facilitating Multi-Role Usage

RareScopeDB provides an online browser (https://bioinf.org.cn:8055/) where users can query the knowledge status of specific rare diseases, the quality of phenotype generation, and the accuracy of gene associations. This tool is valuable for the following roles:

Model Developers: Identify systematic defects in models and guide improvements.
Medical AI Product Managers: Understand the boundary of model capabilities and design human-machine collaboration processes.
Clinicians: Evaluate the credibility of AI-assisted diagnosis and decide when to seek a second opinion.

Section 06

Research Significance and Future Outlook: An Important Paradigm for Medical AI Safety

RareScopeDB provides an important paradigm for the interpretability and safety evaluation of medical AI. It reveals that even advanced LLMs still have significant knowledge gaps in the rare disease domain, especially the problem of 'confident hallucinations' (incorrect but credible outputs), which has a profound impact on clinical deployment. There is a need to establish confidence calibration mechanisms and human-machine responsibility boundaries. In the future, it can be extended to fields such as rare tumors and genetic metabolic diseases, providing directions for model improvements like retrieval-augmented generation and domain-adaptive training.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54