Section 01
[Introduction] NEO-ov: End-to-End Breakthrough of a Native Unified Vision Model
This article introduces NEO-ov, a native vision-language model whose core is to learn cross-frame pixel-to-word correspondences end-to-end without external encoders or adapters. Experiments verify the advantages of this native architecture in fine-grained visual perception and the feasibility of a single vision architecture for large-scale applications.