Section 01
LongCat-Next: Introduction to the Native Multimodal Autoregressive Framework
Meituan's open-source LongCat-Next is a native autoregressive multimodal framework that unifies the discretization of text, visual, and audio information. It uses the DiNA framework to represent multimodal information as discrete tokens uniformly, employs the innovative dNaViT to enable arbitrary-resolution visual tokenization, and achieves unified capabilities of seeing (visual understanding), drawing (image generation), and speaking (voice interaction) under a single autoregressive objective. It addresses issues like fragmentation and poor modal fusion in traditional multimodal architectures and has been open-sourced to promote community development.