Zing Forum

Reading

GenLIP: Teaching ViT to Speak—Generative Language-Image Pre-training for Multimodal Large Models

GenLIP is a minimalist generative pre-training framework that directly trains Vision Transformer (ViT) to predict language tokens from visual tokens via standard language modeling objectives, without contrastive learning or additional text decoders. Trained on only 8B samples, it can match strong baselines and performs excellently on detail-sensitive tasks like OCR and chart understanding.

多模态大模型视觉Transformer生成式预训练视觉编码器自回归模型CLIPMLLM图像理解
Published 2026-05-02 01:51Recent activity 2026-05-04 10:53Estimated read 8 min
GenLIP: Teaching ViT to Speak—Generative Language-Image Pre-training for Multimodal Large Models
1

Section 01

GenLIP: Guide to the Minimalist Generative Vision-Language Pre-training Framework

GenLIP is a minimalist generative pre-training framework for multimodal large models. Its core is training Vision Transformer (ViT) to directly predict language tokens from visual tokens using standard language modeling objectives, without contrastive learning or additional text decoders. Trained on only 8B samples, it can match strong baselines and performs excellently on detail-sensitive tasks like OCR and chart understanding.

2

Section 02

Paradigm Dilemma in Multimodal Pre-training

Multimodal Large Language Models (MLLMs) rely on visual encoder pre-training which faces a choice dilemma: traditional contrastive learning (e.g., CLIP) requires carefully constructed batches, handling hard negative samples, and an independent text encoder; existing generative methods have complex structures, needing additional text decoders or special training objectives. GenLIP breaks this impasse by proposing a minimalist generative pre-training framework.

3

Section 03

Core Innovations of GenLIP

Direct Language Token Prediction

  1. Split images into visual tokens (patch embedding)
  2. Feed visual tokens into a standard Transformer
  3. Objective: predict the next language token of the corresponding text No contrastive learning batch construction, no additional text decoder—only a unified Transformer models both visual and language tokens.

Alignment with the Autoregressive Essence of LLMs

  • Seamless integration: Pre-trained ViT can be directly connected to autoregressive LLMs without adaptation layers
  • Consistent behavior: Shares the "predict next token" inductive bias
  • Simplified architecture: Single Transformer reduces complexity
4

Section 04

Training Efficiency and Multi-Resolution Optimization

Data Efficiency with 8B Samples

GenLIP is trained on 8 billion samples—less data than most leading models yet matches strong baselines. Reasons include high information density of generative objectives, no contrastive bottleneck, and direct optimization of downstream output distribution.

Multi-Resolution Continual Pre-training

After basic pre-training, continual training with native aspect ratio images at multiple resolutions significantly improves performance on detail-sensitive tasks:

  • OCR: Accurately recognize text in images
  • Chart understanding: Parse complex charts
  • Fine-grained visual understanding: Capture tiny details
5

Section 05

Performance Evaluation and Comparison with CLIP Paradigm

Multimodal Benchmark Testing

GenLIP achieves or surpasses strong baselines with more data on tasks like visual question answering and image captioning. The multi-resolution version has obvious advantages in fine-grained tasks, and performance improves predictably with scale expansion.

Comparison with CLIP

Dimension CLIP-style Contrastive Learning GenLIP Generative Pre-training
Architecture Dual encoders (visual + text) Single Transformer
Training Objective Contrastive loss Language modeling loss
Batch Construction Needs carefully constructed positive/negative sample pairs No special batch construction needed
Text Encoder Requires independent training Shares the same Transformer
Alignment with LLM Needs additional adaptation Natively autoregressive aligned
Data Efficiency Usually requires large amounts of data Competitive with only 8B samples
6

Section 06

Technical Implementation Details of GenLIP

Visual Tokenization

  • Patch size: 14×14 or 16×16 pixels
  • Position encoding: 2D sine-cosine or learned positional embeddings
  • Special token: [CLS] for global image representation

Unified Token Space

Visual patches are linearly projected to the word embedding dimension; text tokens go through the word embedding layer. Both share positional encoding and Transformer layers.

Training Strategy

  1. Basic pre-training: Next-token prediction on 8B image-text pairs
  2. Multi-resolution fine-tuning: Continual training with images of different resolutions and native aspect ratios
7

Section 07

Implications and Limitations for MLLM Architecture

Implications

  • Minimalism: Simplified design improves performance
  • Unified pre-training objective: Single autoregressive objective trains strong visual encoders
  • Modal boundary ablation: Unified modeling of visual and language tokens

Limitations

  1. Generation efficiency: Autoregressive generation is slower than contrastive learning
  2. Long text processing: Efficiency bottleneck
  3. Negative sample learning: Lack of explicit negative sample supervision
8

Section 08

Future Directions and Conclusion

Future Directions

  • Hybrid objectives: Combine contrastive and generative objectives
  • Multimodal expansion: Video, audio, and other modalities
  • Efficient inference: Optimize inference for generative visual encoders

Conclusion

GenLIP transforms ViT from a feature extractor into a generative model, bridging the visual-language gap and providing a concise foundation for next-generation MLLMs. It proves the value of returning to the essence (predicting the next token).