# TinyLLaVA Factory: A Modular Training Framework for Small-Scale Multimodal Large Models

> TinyLLaVA Factory is an open-source modular codebase focused on the training and customization of small-scale multimodal large models (LMMs). By supporting various LLM backbones, vision encoders, and connector architectures, this framework enables researchers to customize their own multimodal models with minimal code effort.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-17T06:44:33.000Z
- 最近活动: 2026-04-17T07:24:41.043Z
- 热度: 159.3
- 关键词: 多模态模型, 视觉语言模型, 小规模LLM, TinyLLaVA, Phi-2, SigLIP, 模型训练框架, 边缘部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/tinyllava-factory
- Canonical: https://www.zingnex.cn/forum/thread/tinyllava-factory
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: TinyLLaVA Factory: A Modular Training Framework for Small-Scale Multimodal Large Models

TinyLLaVA Factory is an open-source modular codebase focused on the training and customization of small-scale multimodal large models (LMMs). By supporting various LLM backbones, vision encoders, and connector architectures, this framework enables researchers to customize their own multimodal models with minimal code effort.

## The Miniaturization Trend of Multimodal Models

With the stunning multimodal capabilities of large models like GPT-4V and Claude 3, the industry's attention to vision-language models has continued to rise. However, these top-tier models often have huge parameter counts and high inference costs, making them difficult to deploy on edge devices or in resource-constrained scenarios.

At the same time, research shows that through carefully designed architectures and training strategies, **small-scale models can also achieve surprising multimodal performance**. TinyLLaVA Factory is an open-source project born in response to this trend, providing a complete infrastructure for building and training small multimodal models.

The project's flagship model, TinyLLaVA-Phi-2-SigLIP-3.1B (only 3.1 billion parameters), outperforms traditional models with double the parameter count such as LLaVA-1.5-7B and Qwen-VL-7B in multiple benchmark tests, proving the great potential of small-scale models.

## Core Positioning of the Framework

TinyLLaVA Factory is an open-source modular codebase based on PyTorch and HuggingFace, with its design philosophy centered around three core goals:

**Code Simplicity**: Clear implementation structure to lower the threshold for understanding and modification

**Function Extensibility**: Easy to add new model components and training strategies

**Result Reproducibility**: Provide detailed hyperparameter configurations to ensure consistency of training results

Unlike many "black-box" training frameworks, TinyLLaVA Factory encourages users to deeply understand the working principles of each component and customize them according to their own needs.

## Supported Model Component Ecosystem

The framework's biggest feature lies in its rich component options, allowing users to freely combine different modules to build customized multimodal models:

## Language Model (LLM) Support

- **OpenELM**: Apple's open-source series of efficient language models
- **TinyLlama**: A lightweight yet high-performance model with 1.1B parameters
- **StableLM**: A stable training language model launched by Stability AI
- **Qwen/Qwen2.5**: Alibaba's Tongyi Qianwen series
- **Gemma**: Google's open-source lightweight model
- **Phi-2**: Microsoft Research's efficient small model

## Vision Encoder (Vision Tower) Support

- **CLIP**: OpenAI's classic vision-language pre-trained model
- **SigLIP**: Google's improved vision encoder with better performance in multiple tasks
- **DINOv2**: Meta's self-supervised learning visual feature extractor
- **CLIP+DINO Combination**: Leveraging the complementary features of the two encoders

## Connector Architecture

The connector is responsible for mapping visual features to the input space of the language model. The framework supports multiple design schemes:

- **MLP**: Simple and efficient multi-layer perceptron
- **Q-Former**: Query transformer architecture from BLIP-2
- **Resampler**: Resampler for compressing the number of visual tokens

## Training Strategies

- **Fully Tuning**: Update all parameters
- **Partially Tuning**: Update only specific layers
- **Frozen Tuning**: Freeze some components
- **LoRA/QLoRA**: Parameter-efficient fine-tuning methods
