Section 01
GenLIP: Guide to the Minimalist Generative Vision-Language Pre-training Framework
GenLIP is a minimalist generative pre-training framework for multimodal large models. Its core is training Vision Transformer (ViT) to directly predict language tokens from visual tokens using standard language modeling objectives, without contrastive learning or additional text decoders. Trained on only 8B samples, it can match strong baselines and performs excellently on detail-sensitive tasks like OCR and chart understanding.