Section 01
Introduction: Core of the Educational Project to Build Mini-LLaVA from Scratch
This is an educational open-source project where the author builds the Mini-LLaVA vision-language model from scratch, completing training on an RTX 4060 laptop GPU using a combination of CLIP-ViT and Qwen2.5. The project details the iterative process from v1 to v2, including architecture design, training strategies, and problem-solving ideas, providing a clear path and reference for learning multimodal model development.