Zing Forum

Reading

Building Mini-LLaVA from Scratch: A Record of Iterative Development for a Vision-Language Model

This is an educational project for building a Vision-Language Model (VLM) from scratch. The author trained a runnable Mini-LLaVA model on an RTX 4060 laptop using a combination of CLIP-ViT and Qwen2.5. The project details the iterative process from v1 to v2, including architecture design, training strategies, and problem-solving ideas, making it an excellent reference for learning multimodal model development.

视觉语言模型VLMLLaVA多模态AICLIPQwenLoRA指令微调教育开源
Published 2026-05-13 14:27Recent activity 2026-05-13 14:52Estimated read 7 min
Building Mini-LLaVA from Scratch: A Record of Iterative Development for a Vision-Language Model
1

Section 01

Introduction: Core of the Educational Project to Build Mini-LLaVA from Scratch

This is an educational open-source project where the author builds the Mini-LLaVA vision-language model from scratch, completing training on an RTX 4060 laptop GPU using a combination of CLIP-ViT and Qwen2.5. The project details the iterative process from v1 to v2, including architecture design, training strategies, and problem-solving ideas, providing a clear path and reference for learning multimodal model development.

2

Section 02

Project Background and Learning Value

In the VLM field, LLaVA is a landmark open-source project, but it's difficult to grasp internal mechanisms by directly using existing codebases. To address this learning pain point, this project uses a simplified Mini-LLaVA implementation as the carrier, fully recording the development cycle of "identify problem → analyze cause → iterate and improve". Focusing on the process rather than performance, it becomes a unique resource for learning multimodal development.

3

Section 03

Technical Architecture and Model Selection

Adopts an architecture similar to LLaVA-1.5 but with streamlining and optimization:

  • Vision Encoder: CLIP-ViT-B/32, which has strong general visual representation capabilities, moderate parameter count suitable for consumer-grade hardware, and outputs 49 768-dimensional feature vectors.
  • Language Model: Qwen2.5-0.5B-Instruct, which can run with 8GB VRAM, has good instruction-following ability, and its embedding dimension of 896 requires alignment via a projection layer.
  • Projection Layer: A learnable MLP that maps CLIP's 768-dimensional features to Qwen's 896-dimensional space; it is the only component trained in the first stage.
4

Section 04

Detailed Two-Stage Training Strategy

Follows a two-stage training paradigm:

  1. Projection Pre-training: Freeze CLIP and Qwen, train only the MLP projection layer using 5000 image-text pairs from Flickr30k, with a learning rate of 1e-3. One epoch takes about 7 minutes, with a loss of 2.4403. The goal is to align the visual and language embedding spaces.
  2. Instruction Fine-tuning: Fine-tune Qwen's attention layers using LoRA, while training the projection layer and LoRA parameters; mix 33% each of localized_narratives (long descriptions), aokvqa (reasoning QA), and vqav2 (factual QA) data; use instruction-only label masking, where only the assistant's response part contributes to loss calculation, forcing the model to learn to answer questions rather than imitate text patterns.
5

Section 05

Iterative Improvements from v1 to v2 and Effect Verification

The v1 model had a pattern imitation problem, tending to generate descriptions similar to Flickr30k titles instead of directly answering questions. The root cause was that the first-stage training objective was language modeling rather than instruction following. v2 was significantly improved via instruction fine-tuning: it can adjust the answer format based on question type (open-ended → concise, attribute → value, yes/no → confirmation), and the visual QA accuracy increased from 0/1 to 4/5 (80%).

6

Section 06

Challenges Discovered in Multilingual and Out-of-Distribution (OOD) Testing

  • Multilingual Challenge: Training data is 100% English, and LoRA fine-tuning led to a decline in Korean ability (e.g., when asked what the dog is wearing on its head, the answer was "dog" instead of "hat"), revealing the importance of balanced multilingual data during PEFT.
  • OOD Testing: When tested with a Pikachu image, the model incorrectly classified it as a giraffe, showing a systematic error (mapping to the closest category in the training distribution), so the OOD detection mechanism needs improvement.
7

Section 07

Summary and Future Improvement Directions

Summary: Although this project is not the most performant VLM, it provides learners with a clear path to understand multimodal models through detailed iterative records and problem analysis. Limitations: Impaired multilingual ability, limited OOD handling, single-image input, lack of training recovery mechanism. Future Directions: v3 plans to introduce Korean training data, upgrade to CLIP-ViT-L/14, add an OOD detection module, etc.