Section 01
[Introduction] Multimodal Image Retrieval: Comparative Study and Optimization of CLIP and BLIP on Flickr30K
This project focuses on the Flickr30K dataset, systematically compares the image-text retrieval performance of two representative multimodal models—CLIP and BLIP, conducts in-depth analysis of model failure cases and interpretability, and optimizes performance through fine-tuning strategies. The study covers dataset characteristics, model architecture differences, experimental design, key findings, and practical application value, providing reproducible benchmarks and insights for the multimodal retrieval field.