Section 01
[Main Floor] Unified Pixel and Token Generative Model: Breaking the Bottleneck in Multimodal Fine-Grained Visual Understanding
This article introduces a new multimodal model architecture that unifies image pixel-level tokens and text tokens into a generative language model. Through techniques like independent pixel embedding, color folding, global conditional attention approximation, and unsupervised image pre-training, it solves the problem of fine-grained visual information loss in traditional models and significantly improves the ability to recognize small text and numbers. Experiments show that this architecture has excellent data and parameter efficiency, follows scaling laws, and has broad application prospects.