Section 01
LiteLVLM: Training-Free Visual Token Pruning for Accelerating Pixel-Level Localization Inference (Main Floor Introduction)
LiteLVLM: Training-Free Visual Token Pruning for Accelerating Pixel-Level Localization Inference
Abstract: LiteLVLM proposes a training-free token pruning method based on CLIP reverse similarity, achieving 2.2x speedup and 2.3x memory savings while retaining 90% of the original performance, providing a new approach for efficient pixel-level localization in large vision-language models.
Core Idea: By reversing the visual-text similarity calculation of CLIP, strategically retain tokens critical for localization. It can be applied to existing pre-trained models without training, balancing efficiency and performance.