Section 01
Guide to the In-depth Evaluation of Multi-step Reasoning Capabilities of Small-Parameter Vision-Language Models
This study systematically compares the performance of small vision-language models (VLMs) with 1B-8B parameters and large models on multi-step visual reasoning tasks. It aims to provide empirical evidence for model selection in resource-constrained scenarios (e.g., mobile applications, edge devices) and explore whether small models can handle complex visual reasoning tasks and the gap between them and large models.