Zing Forum

Reading

DigitalRegistrar: Automatically Extracting Structured Medical Data from Pathology Reports Using Large Language Models

This article introduces the DigitalRegistrar project, a medical AI data processing pipeline that uses large language models to process pathology reports, automatically extract structured information, and convert it into JSON format.

病理报告大语言模型医疗数据信息提取NLP医疗AI数据结构化肿瘤登记临床研究LLM
Published 2026-05-23 19:14Recent activity 2026-05-23 19:25Estimated read 8 min
DigitalRegistrar: Automatically Extracting Structured Medical Data from Pathology Reports Using Large Language Models
1

Section 01

DigitalRegistrar Project Introduction: LLM-Driven Structured Data Extraction from Pathology Reports

This article introduces the DigitalRegistrar project, a medical AI data processing pipeline that uses large language models (LLMs) to process pathology reports, automatically extract structured information, and convert it into JSON format. Maintained by kblab2024, the project is open-sourced on GitHub (link: https://github.com/kblab2024/digitalregistrar) and was released on 2026-05-23. It aims to address pain points in the medical field caused by unstructured pathology reports, such as difficulty in information retrieval and limited data analysis, by transforming unstructured data into computable structured data to empower scenarios like clinical decision-making, research acceleration, and quality control.

2

Section 02

Pain Points in Medical Data Digitization and the Need for Structured Pathology Reports

Pain Points in Medical Data Digitization

In modern medicine, pathology reports mostly exist in unstructured forms (PDFs, scanned images, etc.), leading to difficulties in information retrieval, limited data analysis, poor interoperability, and low research efficiency. It is estimated that about 80% of data in the medical industry is unstructured, with an even higher proportion in the field of pathology.

The Need for Structured Pathology Reports

Pathology reports contain key information such as basic patient information, specimen details, clinical information, pathological diagnosis, and staging/grading. Structuring these reports enables:

  • Clinical decision support (automatic alerts, treatment recommendations)
  • Accelerated clinical research (patient screening, data extraction)
  • Quality management and auditing (diagnostic accuracy monitoring)
  • Public health surveillance (tumor registration, disease burden assessment)
3

Section 03

DigitalRegistrar Technical Architecture: LLM-Driven Information Extraction Pipeline

The project adopts a modular pipeline architecture:

  1. Input Preprocessing: Parse PDF/DOCX/scanned images (including OCR), text cleaning, document segmentation
  2. Information Extraction Engine: Prompt engineering-based extraction (enforced JSON output), field-level strategies (simple/complex/nested fields), multi-turn dialogue extraction
  3. Model Selection: Supports GPT-4 (high accuracy), Claude (long documents), Llama 2/3 (open-source local deployment), Med-PaLM (medical specialization), etc.
  4. Post-processing and Validation: Format standardization, data validation (type/range/logic), confidence scoring
  5. Output Formatting: Follows a designed JSON Schema (including fields for patient, specimen, diagnosis, staging, biomarkers, etc.)
4

Section 04

Key Technical Challenges and Countermeasures

  1. Ambiguity of Medical Terminology: Build terminology dictionaries, context disambiguation, knowledge graphs
  2. Heterogeneity of Report Formats: Few-shot learning, strong model generalization, format adaptation
  3. Complex Reasoning Requirements: Chain-of-thought prompting, step-by-step extraction, integration of medical knowledge bases
  4. Data Privacy and Security: Data desensitization, local deployment, encrypted storage and access control
  5. Model Hallucinations: JSON Schema validation, citing original text, confidence thresholds, adversarial testing
5

Section 05

Application Scenarios and Case Studies: Empowering Multiple Medical Links

  1. Automated Tumor Registration: Batch processing of reports, extraction of TNM staging, etc., with a 10x+ speed increase and improved data consistency
  2. Patient Screening for Clinical Research: Real-time matching of enrollment criteria, pushing eligible patients, accelerating enrollment
  3. Pathological Quality Control: Checking for missing fields, verifying logical consistency, improving report quality
  4. Data Integration for Multi-center Research: Standardized formats and coding, reducing integration costs, accelerating research
6

Section 06

Future Development Directions and Implementation Recommendations

Future Directions

  • Multimodal fusion: combining text, images, and genomic data
  • Continuous learning: online learning, active learning, domain adaptation
  • Clinical integration: deep integration with EMR/EHR, real-time decision support
  • Global standardization: multilingual support, international coding standards, cross-border data interoperability

Implementation Recommendations

Medical institutions are advised to start with small-scale pilots and expand gradually, while establishing a sound quality assurance and manual review mechanism.

7

Section 07

Conclusion: The Future of AI-Enabled Medical Data Digitization

DigitalRegistrar demonstrates the potential of LLMs in medical information extraction, capable of handling complex texts and flexibly adapting to new formats. The value of the technology lies in improving diagnosis and treatment, accelerating research, and enhancing quality, while also needing to pay attention to data quality, privacy security, and ethical compliance. AI is not a replacement for human experts but an assistant, allowing doctors to focus on core tasks. With the rapid improvement of current technology maturity, now is the best time to lay out intelligent medical data.