Reading

PVminerLLM: Extracting Structured Patient Voices from Patient-Generated Text Using Large Language Models

This article introduces the PVminerLLM framework, an innovative system that uses large language models to automatically extract structured patient voice signals from unstructured patient-generated text, providing a new technical approach for patient feedback analysis in the medical field.

患者声音大语言模型医疗NLP信息提取LoRA微调PEFT电子病历患者反馈

Published 2026-06-12 05:38Recent activity 2026-06-12 05:49Estimated read 8 min

PVminerLLM: Extracting Structured Patient Voices from Patient-Generated Text Using Large Language Models

Section 01

PVminerLLM: Guide to Extracting Structured Patient Voices Using Large Language Models

Core Views

PVminerLLM is an innovative framework that uses large language models to automatically extract structured patient voice signals from unstructured patient-generated text. It addresses the limitations of traditional questionnaires and provides a new path for patient feedback analysis in the medical field. This project is open-source, offering pre-trained models of various scales to support multi-scenario applications.

Project Basic Information

Original author/maintainer: SarielMa
Source platform: GitHub
Release time: June 11, 2026
Original link: https://github.com/SarielMa/PVminerLLM

Section 02

Research Background and Core Concepts of Patient Voices

Research Background

Traditional patient feedback relies on structured questionnaires, which struggle to capture real, personalized expressions. The popularity of internet-based healthcare has led to an explosive growth of unstructured text, but extracting structured patient voices remains a challenge in medical NLP. Thus, PVminerLLM was developed.

Core Dimensions of Patient Voices

Patient Concerns: Health issues, treatment doubts, prognosis anxiety, etc., to facilitate doctor-patient communication.
Treatment Experience: Drug side effects, medical process, healthcare provider attitude, etc., to guide service improvement.
Contextual Signals: Emotional state, health literacy, social support, etc., to help fully understand patient expressions.

Section 03

Technical Architecture and Implementation Methods

Three-Stage Pipeline Architecture

Supervised Fine-Tuning (SFT): Using LoRA/QLoRA techniques from the PEFT library, keeping original model parameters unchanged while training low-rank matrices. Advantages: High parameter efficiency, avoids overfitting, easy deployment; supports multi-GPU distributed training.
Model Merging: Merge the LoRA adapter back into the base model to generate a dedicated extraction model.
FinBen Evaluation Framework: Precisely measure accuracy, evaluate performance of different signals, and provide fine-grained error analysis.

Code and Usage

Training scripts: sft_peft_ddp.py (distributed training), merge_lora.py (adapter merging), etc.
Environment setup: conda env create -f environment.yml to activate the finben_vllm3 environment.
Training command: torchrun --nproc_per_node=2 sft_peft_ddp.py (specify model, dataset path, and other parameters).

Section 04

Pre-trained Models and Application Scenarios

Pre-trained Models (Released on Hugging Face)

voice_70b_llama3.3_instruct (high-precision offline tasks)
voice_8b_llama3.1_instruct
voice_3b_llama3.2_instruct
voice_qwen2.5_1.5b_instruct (real-time applications)

Application Scenarios

Online Patient Community Analysis: Extract patient concerns and experiences from forums/social media.
Electronic Health Record (EHR) Information Extraction: Structured processing of chief complaints and medical history in EHRs to support clinical decision-making.
Satisfaction Survey Enhancement: Analyze open-ended feedback to identify issues not covered by preset options.
Adverse Drug Reaction Monitoring: Identify spontaneously reported side effects from patients.

Section 05

Technical Contributions and Limitations

Technical Contributions

Domain-specific fine-tuning strategy: Designed data construction, prompts, and evaluation metrics for patient voice extraction.
Multi-model scale coverage: 1.5B to 70B parameters, adapting to different computing resource requirements.
Open-source and reproducible: Complete code and pre-trained models are open-source, supporting follow-up research.

Limitations

Data privacy: Strict desensitization and privacy protection measures are required.
Cross-language adaptability: Currently focused on English; needs adaptation to multi-language and cultural contexts.
Clinical validation: Extracted information requires verification by clinical experts for accuracy and relevance.

Section 06

Conclusion and Future Outlook

Conclusion

PVminerLLM combines the capabilities of large language models with medical needs, providing a feasible solution for extracting structured information from massive patient texts and promoting the implementation of patient-centered medical concepts.

Future Directions

Strengthen data privacy and ethical protection.
Improve cross-language and cross-cultural adaptability.
Conduct clinical validation to ensure the clinical value of extracted information.

This open-source project provides new ideas and tools for the digital transformation of healthcare, and will play an important role in improving medical quality and optimizing resource allocation.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23