Zing Forum

Reading

Using Large Language Models to Extract Structured Features from Pathological Reports: A Study on IgA Nephropathy Subtyping

This project demonstrates a complete workflow for automatically extracting structured features from unstructured pathological reports using a large language model (DeepSeek) and defining clinically actionable subtypes of IgA nephropathy via cluster analysis, including feature extraction, cleaning, embedding, clustering, and interpretability analysis.

LLMpathologyIgA nephropathyfeature extractionclusteringSHAPDeepSeekcomputational pathologyprecision medicine
Published 2026-03-29 03:14Recent activity 2026-03-29 03:23Estimated read 8 min
Using Large Language Models to Extract Structured Features from Pathological Reports: A Study on IgA Nephropathy Subtyping
1

Section 01

[Introduction] Core Overview of LLM-Driven Pathological Subtyping Research for IgA Nephropathy

This project presents a complete workflow for automatically extracting structured features from unstructured IgA nephropathy pathological reports using a large language model (DeepSeek) and defining clinically actionable subtypes through cluster analysis. The core process includes feature extraction, cleaning, embedding, clustering, and interpretability analysis, aiming to solve the problem that traditional pathological reports are difficult to directly use for data analysis and provide a new path for precision medicine. Project code is available at zhji0426/LLM-for-pathological-subtypes.

2

Section 02

Research Background and Significance: Challenges of Pathological Reports and Opportunities for LLMs

Challenges of Pathological Reports: Traditional reports are unstructured (inconsistent formats, scattered information), have terminology differences (different doctors use different terms to describe the same lesion), are difficult to extract manually (time-consuming and error-prone), and hard to scale.

Opportunities for LLMs: Possess strong natural language understanding capabilities, can comprehend medical terminology, extract structured information, handle term variations, and maintain consistency and repeatability.

3

Section 03

Technical Workflow: End-to-End Steps for Pathological Subtyping Analysis

Complete Workflow: Pathological Report Text → Structured JSON Features → Cleaning & Standardization → Chunked Embedding → Cluster Analysis → Stability Validation → Interpretability Analysis

Key Points of Each Stage:

  1. LLM Feature Extraction: Script 01_getFeature_integrated.py uses the DeepSeek API to extract four categories of features (glomerular lesions, tubulointerstitial lesions, vascular lesions, immunofluorescence) and constrains output via prompt templates and JSON Schema.
  2. Data Cleaning: Script 02_clean_pathology_feature.py maps to standardized templates, filters invalid keys, and unifies value formats.
  3. Chunked Embedding: Script embed_ollama_03.py generates embeddings in chunks by the four categories (using Ollama local model) then concatenates them to preserve structural information.
  4. Cluster Analysis: Script 04_robust_clustering_evaluator.py performs two-step PCA dimensionality reduction then compares multiple algorithms like K-means and hierarchical clustering.
  5. Stability Validation: Script 07_stable_classification_analysis.py tests result reliability via subsampling consistency, perturbation stability, and cross-algorithm consistency.
  6. Interpretability: Script 05_interpretability_pipeline.py uses SHAP to analyze key features and verifies feature impacts via counterfactual experiments.
  7. Visualization: Script 06_ncomms_integrated.py generates academic-grade charts (SHAP distribution, cluster stability heatmap, etc.).
4

Section 04

Technical Highlights: Four Innovations Driving Research Breakthroughs

  1. LLM-Driven Information Extraction: Zero-shot capability without fine-tuning, understands medical semantics, and constrains output format via JSON Schema.
  2. Chunked Embedding Strategy: Embeds in chunks by pathological category to avoid information dilution and facilitate subsequent interpretation.
  3. End-to-End Reproducible Workflow: Each stage is automated, input/output is clear, and validation mechanisms are comprehensive.
  4. Stability-First Clustering: Validates cluster results from multiple dimensions to ensure subtypes are real biological signals rather than noise.
5

Section 05

Application Scenarios and Extensions: From IgA Nephropathy to Multi-Domain Migration

Direct Applications: IgA nephropathy cohort studies, prognosis prediction, clinical trial stratification.

Method Promotion: Other glomerular diseases (membranous nephropathy, FSGS), tumor pathology (molecular subtyping feature extraction), radiology reports (structured finding extraction).

Technology Migration: Replaceable LLMs (GPT-4/Claude), embedding models (BioBERT), clustering algorithms (deep clustering).

6

Section 06

Limitations and Challenges: Key Issues to Address

Data Privacy: Strict desensitization, local LLM deployment, and compliance with regulations like HIPAA/GDPR are required.

LLM Limitations: Risk of hallucinations, terminology ambiguity issues, and high cost for large-scale calls.

Validation Needs: The discovered subtypes need prospective clinical validation (prognostic correlation, treatment response prediction, pathologist reproduction).

7

Section 07

Summary and Outlook: The Intelligent Direction of Computational Pathology

This project combines LLMs and machine learning to tap into the value of unstructured pathological reports, representing the development direction of computational pathology: automation (reducing manual work), standardization (unified extraction standards), scaling (processing large cohorts), and intelligence (LLM understanding + ML analysis). In the future, joint analysis of pathological images and text can be realized to improve subtyping accuracy and provide technical references for precision medicine.