Reading

GenLIP: Teaching ViT to Speak—Generative Language-Image Pre-training for Multimodal Large Models

GenLIP is a minimalist generative pre-training framework that directly trains Vision Transformer (ViT) to predict language tokens from visual tokens via standard language modeling objectives, without contrastive learning or additional text decoders. Trained on only 8B samples, it can match strong baselines and performs excellently on detail-sensitive tasks like OCR and chart understanding.

多模态大模型视觉Transformer生成式预训练视觉编码器自回归模型CLIPMLLM图像理解

Published 2026-05-02 01:51Recent activity 2026-05-04 10:53Estimated read 8 min

GenLIP: Teaching ViT to Speak—Generative Language-Image Pre-training for Multimodal Large Models

Section 01

GenLIP: Guide to the Minimalist Generative Vision-Language Pre-training Framework

GenLIP is a minimalist generative pre-training framework for multimodal large models. Its core is training Vision Transformer (ViT) to directly predict language tokens from visual tokens using standard language modeling objectives, without contrastive learning or additional text decoders. Trained on only 8B samples, it can match strong baselines and performs excellently on detail-sensitive tasks like OCR and chart understanding.

Section 02

Paradigm Dilemma in Multimodal Pre-training

Multimodal Large Language Models (MLLMs) rely on visual encoder pre-training which faces a choice dilemma: traditional contrastive learning (e.g., CLIP) requires carefully constructed batches, handling hard negative samples, and an independent text encoder; existing generative methods have complex structures, needing additional text decoders or special training objectives. GenLIP breaks this impasse by proposing a minimalist generative pre-training framework.

Section 03

Core Innovations of GenLIP

Direct Language Token Prediction

Split images into visual tokens (patch embedding)
Feed visual tokens into a standard Transformer
Objective: predict the next language token of the corresponding text No contrastive learning batch construction, no additional text decoder—only a unified Transformer models both visual and language tokens.

Alignment with the Autoregressive Essence of LLMs

Seamless integration: Pre-trained ViT can be directly connected to autoregressive LLMs without adaptation layers
Consistent behavior: Shares the "predict next token" inductive bias
Simplified architecture: Single Transformer reduces complexity

Section 04

Training Efficiency and Multi-Resolution Optimization

Data Efficiency with 8B Samples

GenLIP is trained on 8 billion samples—less data than most leading models yet matches strong baselines. Reasons include high information density of generative objectives, no contrastive bottleneck, and direct optimization of downstream output distribution.

Multi-Resolution Continual Pre-training

After basic pre-training, continual training with native aspect ratio images at multiple resolutions significantly improves performance on detail-sensitive tasks:

OCR: Accurately recognize text in images
Chart understanding: Parse complex charts
Fine-grained visual understanding: Capture tiny details

Section 05

Performance Evaluation and Comparison with CLIP Paradigm

Multimodal Benchmark Testing

GenLIP achieves or surpasses strong baselines with more data on tasks like visual question answering and image captioning. The multi-resolution version has obvious advantages in fine-grained tasks, and performance improves predictably with scale expansion.

Comparison with CLIP

Dimension	CLIP-style Contrastive Learning	GenLIP Generative Pre-training
Architecture	Dual encoders (visual + text)	Single Transformer
Training Objective	Contrastive loss	Language modeling loss
Batch Construction	Needs carefully constructed positive/negative sample pairs	No special batch construction needed
Text Encoder	Requires independent training	Shares the same Transformer
Alignment with LLM	Needs additional adaptation	Natively autoregressive aligned
Data Efficiency	Usually requires large amounts of data	Competitive with only 8B samples

Section 06

Technical Implementation Details of GenLIP

Visual Tokenization

Patch size: 14×14 or 16×16 pixels
Position encoding: 2D sine-cosine or learned positional embeddings
Special token: [CLS] for global image representation

Unified Token Space

Visual patches are linearly projected to the word embedding dimension; text tokens go through the word embedding layer. Both share positional encoding and Transformer layers.

Training Strategy

Basic pre-training: Next-token prediction on 8B image-text pairs
Multi-resolution fine-tuning: Continual training with images of different resolutions and native aspect ratios

Section 07

Implications and Limitations for MLLM Architecture

Implications

Minimalism: Simplified design improves performance
Unified pre-training objective: Single autoregressive objective trains strong visual encoders
Modal boundary ablation: Unified modeling of visual and language tokens

Limitations

Generation efficiency: Autoregressive generation is slower than contrastive learning
Long text processing: Efficiency bottleneck
Negative sample learning: Lack of explicit negative sample supervision

Section 08

Future Directions and Conclusion

Future Directions

Hybrid objectives: Combine contrastive and generative objectives
Multimodal expansion: Video, audio, and other modalities
Efficient inference: Optimize inference for generative visual encoders

Conclusion

GenLIP transforms ViT from a feature extractor into a generative model, bridging the visual-language gap and providing a concise foundation for next-generation MLLMs. It proves the value of returning to the essence (predicting the next token).