Zing Forum

Reading

TinyLLaVA Factory: A Modular Training Framework for Small-Scale Multimodal Large Models

TinyLLaVA Factory is an open-source modular codebase focused on the training and customization of small-scale multimodal large models (LMMs). By supporting various LLM backbones, vision encoders, and connector architectures, this framework enables researchers to customize their own multimodal models with minimal code effort.

多模态模型视觉语言模型小规模LLMTinyLLaVAPhi-2SigLIP模型训练框架边缘部署
Published 2026-04-17 14:44Recent activity 2026-04-17 15:24Estimated read 6 min
TinyLLaVA Factory: A Modular Training Framework for Small-Scale Multimodal Large Models
1

Section 01

Introduction / Main Floor: TinyLLaVA Factory: A Modular Training Framework for Small-Scale Multimodal Large Models

TinyLLaVA Factory is an open-source modular codebase focused on the training and customization of small-scale multimodal large models (LMMs). By supporting various LLM backbones, vision encoders, and connector architectures, this framework enables researchers to customize their own multimodal models with minimal code effort.

2

Section 02

The Miniaturization Trend of Multimodal Models

With the stunning multimodal capabilities of large models like GPT-4V and Claude 3, the industry's attention to vision-language models has continued to rise. However, these top-tier models often have huge parameter counts and high inference costs, making them difficult to deploy on edge devices or in resource-constrained scenarios.

At the same time, research shows that through carefully designed architectures and training strategies, small-scale models can also achieve surprising multimodal performance. TinyLLaVA Factory is an open-source project born in response to this trend, providing a complete infrastructure for building and training small multimodal models.

The project's flagship model, TinyLLaVA-Phi-2-SigLIP-3.1B (only 3.1 billion parameters), outperforms traditional models with double the parameter count such as LLaVA-1.5-7B and Qwen-VL-7B in multiple benchmark tests, proving the great potential of small-scale models.

3

Section 03

Core Positioning of the Framework

TinyLLaVA Factory is an open-source modular codebase based on PyTorch and HuggingFace, with its design philosophy centered around three core goals:

Code Simplicity: Clear implementation structure to lower the threshold for understanding and modification

Function Extensibility: Easy to add new model components and training strategies

Result Reproducibility: Provide detailed hyperparameter configurations to ensure consistency of training results

Unlike many "black-box" training frameworks, TinyLLaVA Factory encourages users to deeply understand the working principles of each component and customize them according to their own needs.

4

Section 04

Supported Model Component Ecosystem

The framework's biggest feature lies in its rich component options, allowing users to freely combine different modules to build customized multimodal models:

5

Section 05

Language Model (LLM) Support

  • OpenELM: Apple's open-source series of efficient language models
  • TinyLlama: A lightweight yet high-performance model with 1.1B parameters
  • StableLM: A stable training language model launched by Stability AI
  • Qwen/Qwen2.5: Alibaba's Tongyi Qianwen series
  • Gemma: Google's open-source lightweight model
  • Phi-2: Microsoft Research's efficient small model
6

Section 06

Vision Encoder (Vision Tower) Support

  • CLIP: OpenAI's classic vision-language pre-trained model
  • SigLIP: Google's improved vision encoder with better performance in multiple tasks
  • DINOv2: Meta's self-supervised learning visual feature extractor
  • CLIP+DINO Combination: Leveraging the complementary features of the two encoders
7

Section 07

Connector Architecture

The connector is responsible for mapping visual features to the input space of the language model. The framework supports multiple design schemes:

  • MLP: Simple and efficient multi-layer perceptron
  • Q-Former: Query transformer architecture from BLIP-2
  • Resampler: Resampler for compressing the number of visual tokens
8

Section 08

Training Strategies

  • Fully Tuning: Update all parameters
  • Partially Tuning: Update only specific layers
  • Frozen Tuning: Freeze some components
  • LoRA/QLoRA: Parameter-efficient fine-tuning methods