With the stunning multimodal capabilities of large models like GPT-4V and Claude 3, the industry's attention to vision-language models has continued to rise. However, these top-tier models often have huge parameter counts and high inference costs, making them difficult to deploy on edge devices or in resource-constrained scenarios.
At the same time, research shows that through carefully designed architectures and training strategies, small-scale models can also achieve surprising multimodal performance. TinyLLaVA Factory is an open-source project born in response to this trend, providing a complete infrastructure for building and training small multimodal models.
The project's flagship model, TinyLLaVA-Phi-2-SigLIP-3.1B (only 3.1 billion parameters), outperforms traditional models with double the parameter count such as LLaVA-1.5-7B and Qwen-VL-7B in multiple benchmark tests, proving the great potential of small-scale models.