Section 01
Introduction: Kimi-VL — A Compact Yet Powerful Multimodal Vision-Language Model
Introduction: Kimi-VL — A Compact Yet Powerful Multimodal Vision-Language Model
Moonshot AI's open-source Kimi-VL uses a Mixture of Experts (MoE) architecture, with a total of 16B parameters but only 3B activated during inference. It excels in scenarios such as 128K long context, multimodal reasoning, and agent tasks. Its Thinking version outperforms 70B-scale open-source models on mathematical reasoning benchmarks and even surpasses GPT-4o in some scenarios, providing a new solution for balancing efficiency and performance in multimodal models.