Section 01
Introduction to Spatial Reasoning Enhancement Research for Small-Parameter VLMs
For lightweight Vision-Language Models (VLMs) with less than 1 billion parameters, this study explores parameter-efficient fine-tuning methods through CV-Bench benchmark testing to enhance their 3D spatial understanding and depth estimation capabilities. The 500M-parameter SmolVLM-500M-Instruct is selected as the baseline model, which achieves an initial accuracy of 43.18% on CV-Bench. The research goal is to significantly improve spatial reasoning performance while maintaining the model's lightweight nature, providing support for scenarios such as edge deployment.