Section 01
[Introduction] Xuanwu VL-2B: An Industrial-Grade Multimodal Foundation Model for Content Ecosystems
Xuanwu VL-2B is an industrial-grade multimodal foundation model for content ecosystems. It adopts a compact architecture of InternViT-300M + MLP + Qwen3 1.7B (about 2B parameters). Through an iterative data filtering mechanism and three-stage progressive training, it achieves a balance between business alignment, visual perception, and general capabilities. Its recall rate in adversarial OCR scenarios reaches 82.82%, surpassing Gemini-2.5-Pro; the average recall rate for business audit tasks is 94.38%; its general multimodal capabilities on the OpenCompass benchmark are superior to similar models, balancing deployment cost and efficiency.