The entire system adopts a layered architecture design, divided into three main modules: data pipeline, training process, and deployment service:
Data Layer: Integrates four public medical Q&A datasets (MediQAL MCQU, FrenchMedMCQA, MedQuAD, UltraMedical). After cleaning and anonymization, it generates 5000 SFT training samples and 5000 DPO preference alignment samples.
Training Layer: Uses Qwen3-1.7B-Base as the foundation model. First, it performs 4-bit quantized supervised fine-tuning via QLoRA (LoRA rank set to 16), then aligns with human preferences through DPO (Direct Preference Optimization). The training process uses MLflow for experiment tracking, and model weights are stored in Google Cloud Storage.
Inference Layer: The merged complete model is deployed via vLLM, supporting continuous batching and PagedAttention optimization, and provides a FastAPI REST interface externally. The entire service is containerized and deployed on a GCP virtual machine, with CI/CD automation implemented via GitHub Actions.