Section 01
DeepScan: Guide to the Training-Free Visual Reasoning Enhancement Framework for Large Vision-Language Models
DeepScan is a training-free framework designed to enhance the performance of large vision-language models (LVLMs) on fine-grained visual reasoning tasks. It simulates the human bottom-up reasoning process through three core stages: hierarchical scanning, refocusing, and evidence-enhanced reasoning. Experiments show that this framework can significantly improve model performance—for example, in the V* benchmark, using Qwen2.5-VL-7B as the backbone model achieves an overall accuracy of 90.6%, an increase of 16.3% compared to the original model.