Section 01
【Main Floor】LiteLVLM: Guide to Training-Free Efficient Token Pruning for Vision-Language Models
The LiteLVLM technique proposed by the Computer Vision Research Group at Sejong University achieves efficient token pruning without any training or fine-tuning by reversing CLIP's vision-text similarity mechanism. This technique retains approximately 90% of the performance in pixel-level localization tasks while achieving a 2.2x inference speedup and 2.3x memory optimization, providing a new solution for the efficient deployment of large vision-language models.