Zing Forum

Reading

ClinicalDx-NLP: A Medical AI Dataset for Converting Discharge Records into Structured ICD-10 Codes

A medical NLP dataset containing 50,000 synthetic discharge summaries, covering ICD-10 codes, CPT codes, DRG codes, and 6 categories of NER annotations, designed specifically for clinical NLP, medical coding AI, and large language model fine-tuning.

医疗NLPICD-10编码临床文本挖掘命名实体识别医疗数据集大语言模型微调医疗AI
Published 2026-04-30 18:14Recent activity 2026-04-30 18:18Estimated read 7 min
ClinicalDx-NLP: A Medical AI Dataset for Converting Discharge Records into Structured ICD-10 Codes
1

Section 01

ClinicalDx-NLP Dataset Guide: Addressing Data Pain Points in Medical Coding and AI Research

ClinicalDx-NLP is a medical NLP dataset containing 50,000 synthetic discharge summaries, covering ICD-10 codes, CPT codes, DRG codes, and 6 categories of NER annotations, designed specifically for clinical NLP, medical coding AI, and large language model fine-tuning. It solves the pain points of high error rates in manual medical coding, high barriers to accessing high-quality medical data, and the lack of datasets with both ICD-10 and NER annotations. Additionally, the data is HIPAA-security certified and can be used without qualification review.

2

Section 02

Challenges in Medical Coding and Limitations of Existing Data

Medical coding errors cause over $25 billion in losses to the U.S. healthcare system annually, with approximately 22% of manual coding containing errors, which is the second leading cause of insurance claim denials. Traditional manual coding is inefficient and error-prone; datasets like MIMIC-III require strict qualification reviews (a process of several weeks), hindering the development of medical AI; currently, there is a lack of high-quality discharge summary datasets that include both ICD-10 diagnostic codes and NER annotations, limiting research on end-to-end medical coding automation.

3

Section 03

Core Components and File Structure of the ClinicalDx-NLP Dataset

ClinicalDx-NLP contains 50,000 synthetic and highly realistic discharge summaries, certified by HIPAA security. Core files include:

  • discharge_summaries.csv: 50,000 complete summaries with ICD-10, CPT, DRG codes, and demographic information
  • ner_annotations.jsonl: 50,000 entries with 6 categories of NER entity annotations
  • train_test_split.csv: 70/15/15 stratified training-test split
  • icd10_reference.csv: Approximately 14 ICD-10 codes with specialty and DRG mappings
  • data_dictionary.csv: Data schema reference document
4

Section 04

Data Field Design and Detailed Explanation of 6 NER Annotation Categories

Each discharge summary includes: Basic Information: summary_id, admission/discharge date, length of hospital stay, age, gender, etc. Clinical Codes: Primary/secondary ICD-10 codes, CPT procedure codes, DRG codes Discharge Status: Disposition (home/professional care, etc.), condition status (stable/improved, etc.) Text Content: 200-400 word summary, word count, number of NER entities

NER annotations consist of 6 categories: MEDICATION, PROCEDURE, DIAGNOSIS, LAB_VALUE, ANATOMY, TEMPORAL. Annotations include text, type, and character positions, which can be directly used for training in NLP frameworks like spaCy.

5

Section 05

Strict Quality Checks Ensure Data Authenticity and Logical Consistency

Data quality assurance mechanisms:

  1. Specialty Locking: ICD-10 codes strictly correspond to specialties (e.g., no sepsis codes in obstetrics)
  2. Code Matching: CPT codes match diagnostic codes (e.g., no colonoscopy CPT codes for breast cancer patients)
  3. Clinical Logic: Medications/lab values/vital signs match diagnoses (e.g., sepsis patients have abnormal lactate levels and low blood pressure)
  4. Demographic Constraints: Age range (18-42 for obstetrics), gender distribution (100% female for obstetrics), length of stay calibrated by DRG weight All data has passed consistency checks to ensure real-world logic.
6

Section 06

Medical AI Application Scenarios Supported by the Dataset

The dataset supports multiple medical AI applications:

  • ICD-10 Code Prediction: End-to-end prediction using models like TF-IDF+logistic regression or BERT
  • NER Training: Train medical NER models based on the 6 annotation categories
  • LLM Fine-tuning: Build instruction pairs for fine-tuning open-source LLMs like Llama/Mistral
  • Length of Stay Prediction: Build models based on features like age and specialty to assist resource planning Supporting visualization tools generate 5 academic charts (dataset overview, NER analysis, model analysis, etc.) to facilitate data understanding and ROI evaluation.
7

Section 07

Dataset Access Methods and Project Significance Outlook

Dataset access methods:

  • GitHub Clone: git clone https://github.com/NudratDS/ClinicalDx-NLP, install dependencies and run the generation script
  • Kaggle Direct Use: [kaggle.com/datasets/nudratabbas/clinicaldx-nlp]

Project Significance: Fills gaps in medical NLP data, lowers the entry barrier for AI research, and lays the foundation for automated coding and clinical decision support. In the future, it will promote the transition of medical AI from the laboratory to clinical practice, becoming key infrastructure.