Reading

Practical Guide to Fine-Tuning a Medical Triage Large Model: A Complete MLOps Pipeline Based on Qwen3-1.7B

This article introduces a complete medical triage large model fine-tuning project using the Qwen3-1.7B base model. It performs supervised fine-tuning (SFT) via QLoRA, aligns with human preferences through DPO, and finally deploys as a vLLM+FastAPI inference service. The project covers the entire workflow from data pipeline, training, evaluation to CI/CD deployment.

医疗AI大模型微调QLoRADPOQwen3MLOpsvLLMFastAPI医疗分诊

Published 2026-05-26 17:14Recent activity 2026-05-26 17:21Estimated read 7 min

Section 01

Introduction / Main Floor: Practical Guide to Fine-Tuning a Medical Triage Large Model: A Complete MLOps Pipeline Based on Qwen3-1.7B

Section 02

Original Author and Source

Original Author/Maintainer: RandomFab
Source Platform: GitHub
Original Title: medical-triage-llm-finetuning
Original Link: https://github.com/RandomFab/medical-triage-llm-finetuning
Source Publication/Update Time: 2026-05-26T09:14:13Z

Section 03

Project Background and Objectives

Medical triage is a key component in hospital emergency workflows, which requires quickly assessing the urgency level (immediate/medium/delayed) based on patients' described symptoms. Traditional manual triage relies on experienced nurses, but AI-assisted triage can significantly boost efficiency when medical resources are strained. This project, initiated by Centre Hospitalier Saint-Aurélien (CHSA), aims to build an AI assistant capable of processing both English and French patient descriptions and automatically classifying urgency levels. The project uses the Apache 2.0 open-source license and fully showcases the end-to-end implementation from data preparation to production deployment.

Section 04

Technical Architecture Overview

The entire system adopts a layered architecture design, divided into three main modules: data pipeline, training process, and deployment service:

Data Layer: Integrates four public medical Q&A datasets (MediQAL MCQU, FrenchMedMCQA, MedQuAD, UltraMedical). After cleaning and anonymization, it generates 5000 SFT training samples and 5000 DPO preference alignment samples.

Training Layer: Uses Qwen3-1.7B-Base as the foundation model. First, it performs 4-bit quantized supervised fine-tuning via QLoRA (LoRA rank set to 16), then aligns with human preferences through DPO (Direct Preference Optimization). The training process uses MLflow for experiment tracking, and model weights are stored in Google Cloud Storage.

Inference Layer: The merged complete model is deployed via vLLM, supporting continuous batching and PagedAttention optimization, and provides a FastAPI REST interface externally. The entire service is containerized and deployed on a GCP virtual machine, with CI/CD automation implemented via GitHub Actions.

Section 05

Efficient Fine-Tuning with QLoRA

QLoRA (Quantized Low-Rank Adaptation) is one of the core technologies of this project. By adding low-rank adapters to the 4-bit Normal Float quantized base model, training can be completed on a single GPU with 16GB VRAM (such as T4, L4). Compared to full-parameter fine-tuning, QLoRA reduces VRAM usage by approximately 75% while maintaining good fine-tuning results.

Section 06

DPO Preference Alignment

Traditional RLHF (Reinforcement Learning from Human Feedback) requires training a reward model, which has a complex process. DPO learns directly from preference data, transforming the problem into a simple classification task and greatly simplifying the implementation. The DPO data in the project uses the triplet format (question, preferred answer, non-preferred answer) from the UltraMedical-Preference dataset.

Section 07

DVC Data Version Control

Medical data involves privacy and compliance requirements, so the project uses DVC (Data Version Control) to manage the data pipeline. From raw data download to final training set generation, 6 processing stages are defined (clean → anonymize → tokenize → split). Any parameter change will automatically trigger the re-execution of the corresponding stage.

Section 08

Training Results and Evaluation

In the SFT phase, the loss on the training set dropped to 1.112, and the validation set loss was 1.189, indicating good model convergence. The project includes 70 unit tests covering the data pipeline, API interfaces, and model inference logic. Currently, the project is in the 4th week of the deployment phase, the API service is ready, and final production environment verification is underway. The DPO-aligned model and the complete technical report are also being developed in parallel.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15