Reading

Building Mini-LLaVA from Scratch: A Record of Iterative Development for a Vision-Language Model

This is an educational project for building a Vision-Language Model (VLM) from scratch. The author trained a runnable Mini-LLaVA model on an RTX 4060 laptop using a combination of CLIP-ViT and Qwen2.5. The project details the iterative process from v1 to v2, including architecture design, training strategies, and problem-solving ideas, making it an excellent reference for learning multimodal model development.

视觉语言模型VLMLLaVA多模态AICLIPQwenLoRA指令微调教育开源

Published 2026-05-13 14:27Recent activity 2026-05-13 14:52Estimated read 7 min

Building Mini-LLaVA from Scratch: A Record of Iterative Development for a Vision-Language Model

Section 01

Introduction: Core of the Educational Project to Build Mini-LLaVA from Scratch

This is an educational open-source project where the author builds the Mini-LLaVA vision-language model from scratch, completing training on an RTX 4060 laptop GPU using a combination of CLIP-ViT and Qwen2.5. The project details the iterative process from v1 to v2, including architecture design, training strategies, and problem-solving ideas, providing a clear path and reference for learning multimodal model development.

Section 02

Project Background and Learning Value

In the VLM field, LLaVA is a landmark open-source project, but it's difficult to grasp internal mechanisms by directly using existing codebases. To address this learning pain point, this project uses a simplified Mini-LLaVA implementation as the carrier, fully recording the development cycle of "identify problem → analyze cause → iterate and improve". Focusing on the process rather than performance, it becomes a unique resource for learning multimodal development.

Section 03

Technical Architecture and Model Selection

Adopts an architecture similar to LLaVA-1.5 but with streamlining and optimization:

Vision Encoder: CLIP-ViT-B/32, which has strong general visual representation capabilities, moderate parameter count suitable for consumer-grade hardware, and outputs 49 768-dimensional feature vectors.
Language Model: Qwen2.5-0.5B-Instruct, which can run with 8GB VRAM, has good instruction-following ability, and its embedding dimension of 896 requires alignment via a projection layer.
Projection Layer: A learnable MLP that maps CLIP's 768-dimensional features to Qwen's 896-dimensional space; it is the only component trained in the first stage.

Section 04

Detailed Two-Stage Training Strategy

Follows a two-stage training paradigm:

Projection Pre-training: Freeze CLIP and Qwen, train only the MLP projection layer using 5000 image-text pairs from Flickr30k, with a learning rate of 1e-3. One epoch takes about 7 minutes, with a loss of 2.4403. The goal is to align the visual and language embedding spaces.
Instruction Fine-tuning: Fine-tune Qwen's attention layers using LoRA, while training the projection layer and LoRA parameters; mix 33% each of localized_narratives (long descriptions), aokvqa (reasoning QA), and vqav2 (factual QA) data; use instruction-only label masking, where only the assistant's response part contributes to loss calculation, forcing the model to learn to answer questions rather than imitate text patterns.

Section 05

Iterative Improvements from v1 to v2 and Effect Verification

The v1 model had a pattern imitation problem, tending to generate descriptions similar to Flickr30k titles instead of directly answering questions. The root cause was that the first-stage training objective was language modeling rather than instruction following. v2 was significantly improved via instruction fine-tuning: it can adjust the answer format based on question type (open-ended → concise, attribute → value, yes/no → confirmation), and the visual QA accuracy increased from 0/1 to 4/5 (80%).

Section 06

Challenges Discovered in Multilingual and Out-of-Distribution (OOD) Testing

Multilingual Challenge: Training data is 100% English, and LoRA fine-tuning led to a decline in Korean ability (e.g., when asked what the dog is wearing on its head, the answer was "dog" instead of "hat"), revealing the importance of balanced multilingual data during PEFT.
OOD Testing: When tested with a Pikachu image, the model incorrectly classified it as a giraffe, showing a systematic error (mapping to the closest category in the training distribution), so the OOD detection mechanism needs improvement.

Section 07

Summary and Future Improvement Directions

Summary: Although this project is not the most performant VLM, it provides learners with a clear path to understand multimodal models through detailed iterative records and problem analysis. Limitations: Impaired multilingual ability, limited OOD handling, single-image input, lack of training recovery mechanism. Future Directions: v3 plans to introduce Korean training data, upgrade to CLIP-ViT-L/14, add an OOD detection module, etc.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15