Section 01
Introduction to GRIP: A New Feedback-Based Retrieval Method for Multimodal In-Context Examples
GRIP proposes a learnable retrieval framework based on model feedback. It identifies examples that truly improve ICL performance through contrastive training, addressing the limitations of traditional similarity-based retrieval in multimodal scenarios. It consistently outperforms similarity-based retrieval methods in image classification, image caption generation, and Visual Question Answering (VQA) tasks, and also has cross-model transferability.