# Large Multimodal Model Paper Repository: A Panoramic View of Visual-Language Model Evolution from CLIP to Qwen3-VL

> An open-source paper list comprehensively organizing the development history of large multimodal models, covering key models and review literature from 2021 to 2026, providing a systematic learning roadmap for researchers and developers.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-02T07:08:33.000Z
- 最近活动: 2026-06-02T07:21:19.189Z
- 热度: 151.8
- 关键词: 多模态模型, 视觉语言模型, VLM, CLIP, LLaVA, Qwen-VL, DeepSeek-VL, InternVL, 论文清单, 人工智能, 机器学习, 计算机视觉, 自然语言处理
- 页面链接: https://www.zingnex.cn/en/forum/thread/clipqwen3-vl
- Canonical: https://www.zingnex.cn/forum/thread/clipqwen3-vl
- Markdown 来源: floors_fallback

---

## Introduction: Large Multimodal Model Paper Repository — Panoramic Navigation of VLM Evolution

The open-source project Awesome-Large-Multimodal-Model maintained by youngtboy on GitHub is a paper list systematically organizing the development of Visual-Language Models (VLMs) from 2021 to 2026. It covers key models such as CLIP, LLaVA, Qwen3-VL, and review literature, providing a learning roadmap for researchers and developers to help clarify the technical evolution path.

## Background: Why Do We Need This Resource List?

VLMs have rapidly evolved from image-text alignment to cross-modal reasoning, but dozens of papers and projects emerging each year make it difficult for researchers to locate foundational work, technical trends, and model inheritance relationships. A systematically organized resource repository is urgently needed to address this pain point.

## Project Overview: Structure and Content Organization

The project organizes VLM resources from 2021 to 2026 in a chronological manner. Each entry includes the model's abbreviation, full title, publication conference/journal, paper link, and code repository (if available). Additionally, a Survey section contains 5 review articles to provide introductory guidance for beginners.

## Evidence of Technical Evolution: Five Key Stages

1. Foundation Period (2021): CLIP initiated the era of image-text pre-training; 2. Unified Architecture Exploration (2022-2023): BLIP/LLaVA/Qwen-VL and others explored the instruction tuning paradigm; 3. Scaling and Engineering Optimization (2023-2024): InternVL/DeepSeek-VL and others pushed the performance boundaries; 4. Specialized Breakthroughs (2024-2025): Vertical domain applications like MedVLM-R1/DeepSeek-OCR; 5. Reasoning Enhancement (2025-present): R1-V/Qwen3-VL introduced reinforcement learning to improve reasoning capabilities.

## Core Conclusions: Key Trends in the VLM Field

1. The open-source ecosystem is thriving; most projects being open-source accelerates the development of the field; 2. Chinese academic strength is on the rise (models like Qwen-VL/InternVL perform outstandingly); 3. Coexistence of technical route convergence (instruction tuning becomes standard) and divergence (explorations like encoder-free/generative pre-training); 4. Paradigm shift from "understanding" to "reasoning".

## Usage Recommendations: Guide to Efficiently Using the Repository

Academic researchers can quickly locate key papers and track achievements; industrial developers can evaluate model selection; beginners can start learning from review articles. Recommendations: First read reviews to build a macro understanding, prioritize projects with open-source code, track technical inheritance by year, and consider model design trade-offs in combination with application scenarios.
