Section 01
Tuna-2: Abandoning Visual Encoders, Pixel Embedding Leads a New Direction in Multimodal Models
Tuna-2 proposes a natively unified multimodal model that completely abandons pre-trained visual encoders. It performs visual understanding and generation directly from raw pixels via a simple pixel embedding layer, achieving state-of-the-art results on multiple benchmarks. This proves that end-to-end pixel-space learning is a scalable path to building stronger visual representations.