Reading

Practical Guide to Fine-Tuning Local Large Language Models: Detailed Explanation of LoRA and DoRA Parameter-Efficient Fine-Tuning Techniques

A complete guide to local LLM fine-tuning using the Qwen3-4B model, enabling LoRA and DoRA adapter training on consumer CPUs without a GPU, and exporting to GGUF format compatible with Ollama.

LoRADoRA参数高效微调Qwen3本地部署Ollama大语言模型PEFTCPU训练GGUF

Published 2026-06-03 04:44Recent activity 2026-06-03 04:48Estimated read 7 min

Practical Guide to Fine-Tuning Local Large Language Models: Detailed Explanation of LoRA and DoRA Parameter-Efficient Fine-Tuning Techniques

Section 01

【Introduction】Practical Fine-Tuning of Qwen3-4B on Local CPU: Detailed Explanation of LoRA and DoRA Techniques

This project, published by Hassan Butt on GitHub (link: https://github.com/Hassan-Butt4356/llm-finetuning-lora-dora), provides a complete guide to parameter-efficient fine-tuning of the Qwen3-4B-Instruct model using LoRA/DoRA on consumer CPUs without a GPU. It also supports exporting to GGUF format compatible with Ollama, significantly lowering the technical barrier for individual developers and small teams to fine-tune large models locally.

Section 02

Background: The Necessity of Parameter-Efficient Fine-Tuning

Traditional full fine-tuning of large models requires expensive GPU clusters, with extremely high demands on memory and storage, which hinders local fine-tuning by individuals and small teams. Parameter-Efficient Fine-Tuning (PEFT) techniques only train a small number of additional parameters, reducing resource requirements while maintaining performance. This project provides a local solution based on the PEFT approach.

Section 03

Comparison of LoRA and DoRA Technical Principles

LoRA Principle

Core idea: Weight changes during model fine-tuning have low intrinsic rank. Introduce low-rank matrices A (random Gaussian initialization) and B (zero initialization), output h=Wx+BAx, and only update A and B during training. Advantages: High parameter efficiency (saves ~98% parameters when rank r=8), modularity, composability.

DoRA Improvements

Decompose pre-trained weights into magnitude and direction components, and apply low-rank updates only to the direction component.

Comparison

Feature	LoRA	DoRA
Training Speed	Baseline	10-15% slower
Small Dataset Quality	Good	Better
Large Dataset Quality	Very Good	Very Good
Memory Usage	Less	Slightly More

The project provides train_lora.py and train_dora.py scripts for selection.

Section 04

Complete Workflow and Hardware Requirements

Hardware Requirements

CPU: Intel Core Ultra7 255H or equivalent
Memory: 32GB (minimum 16GB)
Storage: ~30GB free space
System: Windows10/11 or Linux
Python3.10+ and Ollama installed

7-Step Workflow

Download Qwen3-4B-Instruct base model (~8GB)
Put PDFs into the my_pdfs/ directory; the script automatically extracts text to generate JSONL data (requires Ollama to run) 3-4. Run LoRA/DoRA training script (~20-40 minutes for 50 samples, ~5-10 hours for 500 samples)
Merge adapter with base model (generate ~16GB complete model)
Convert to GGUF format using llama.cpp
Test the model to verify results

Section 05

Interpretation of Key Configuration Parameters

Adjustable parameters in config.py:

Parameter	Default Value	Description
LORA_RANK	8	Higher rank means larger capacity but slower training
EPOCHS	1	CPU training recommended to start with 1 epoch
MAX_SEQ_LEN	512	Reducing it speeds up CPU training
LEARNING_RATE	2e-4	Standard learning rate for LoRA
MAX_CHUNKS_PER_PDF	30	Increasing gives more training data

Section 06

Practical Application Value of the Project

Data Privacy: All processing is done locally; sensitive documents do not need to be uploaded to the cloud
Cost-Effectiveness: No need to rent GPU servers; existing hardware can customize models
Fast Iteration: Adapters are only ~50MB, easy for version management and deployment
Scalability: Modular design is easy to integrate into MLOps workflows

Section 07

Summary of Technical Points and Practical Recommendations

Key points to note for project reproduction:

Memory Management: 32GB is a comfortable configuration; 16GB requires adjusting MAX_SEQ_LEN and BATCH_SIZE
Data Quality: Text extraction quality from PDFs affects results; preprocessing to clean noise is recommended
Iteration Strategy: First validate the process with small datasets, then expand to full data
Training Monitoring: Observe loss curves and adjust learning rate timely

This project provides a blueprint for local LLM fine-tuning for individuals and small-to-medium teams, demonstrating the value of PEFT techniques in resource-constrained environments.