Zing Forum

Reading

Large Multimodal Model Paper Repository: A Panoramic View of Visual-Language Model Evolution from CLIP to Qwen3-VL

An open-source paper list comprehensively organizing the development history of large multimodal models, covering key models and review literature from 2021 to 2026, providing a systematic learning roadmap for researchers and developers.

多模态模型视觉语言模型VLMCLIPLLaVAQwen-VLDeepSeek-VLInternVL论文清单人工智能
Published 2026-06-02 15:08Recent activity 2026-06-02 15:21Estimated read 4 min
Large Multimodal Model Paper Repository: A Panoramic View of Visual-Language Model Evolution from CLIP to Qwen3-VL
1

Section 01

Introduction: Large Multimodal Model Paper Repository — Panoramic Navigation of VLM Evolution

The open-source project Awesome-Large-Multimodal-Model maintained by youngtboy on GitHub is a paper list systematically organizing the development of Visual-Language Models (VLMs) from 2021 to 2026. It covers key models such as CLIP, LLaVA, Qwen3-VL, and review literature, providing a learning roadmap for researchers and developers to help clarify the technical evolution path.

2

Section 02

Background: Why Do We Need This Resource List?

VLMs have rapidly evolved from image-text alignment to cross-modal reasoning, but dozens of papers and projects emerging each year make it difficult for researchers to locate foundational work, technical trends, and model inheritance relationships. A systematically organized resource repository is urgently needed to address this pain point.

3

Section 03

Project Overview: Structure and Content Organization

The project organizes VLM resources from 2021 to 2026 in a chronological manner. Each entry includes the model's abbreviation, full title, publication conference/journal, paper link, and code repository (if available). Additionally, a Survey section contains 5 review articles to provide introductory guidance for beginners.

4

Section 04

Evidence of Technical Evolution: Five Key Stages

  1. Foundation Period (2021): CLIP initiated the era of image-text pre-training; 2. Unified Architecture Exploration (2022-2023): BLIP/LLaVA/Qwen-VL and others explored the instruction tuning paradigm; 3. Scaling and Engineering Optimization (2023-2024): InternVL/DeepSeek-VL and others pushed the performance boundaries; 4. Specialized Breakthroughs (2024-2025): Vertical domain applications like MedVLM-R1/DeepSeek-OCR; 5. Reasoning Enhancement (2025-present): R1-V/Qwen3-VL introduced reinforcement learning to improve reasoning capabilities.
5

Section 05

Core Conclusions: Key Trends in the VLM Field

  1. The open-source ecosystem is thriving; most projects being open-source accelerates the development of the field; 2. Chinese academic strength is on the rise (models like Qwen-VL/InternVL perform outstandingly); 3. Coexistence of technical route convergence (instruction tuning becomes standard) and divergence (explorations like encoder-free/generative pre-training); 4. Paradigm shift from "understanding" to "reasoning".
6

Section 06

Usage Recommendations: Guide to Efficiently Using the Repository

Academic researchers can quickly locate key papers and track achievements; industrial developers can evaluate model selection; beginners can start learning from review articles. Recommendations: First read reviews to build a macro understanding, prioritize projects with open-source code, track technical inheritance by year, and consider model design trade-offs in combination with application scenarios.