Zing Forum

Reading

LongCat-Next: A Native Autoregressive Framework for Unified Discretization of Multimodal Information

Meituan's open-source LongCat-Next unifies text, visual, and audio information into discrete tokens via the DiNA framework, uses the innovative dNaViT to enable arbitrary-resolution visual tokenization, and achieves unified multimodal capabilities of seeing, drawing, and speaking under a single autoregressive objective.

LongCat-NextDiNA多模态模型离散token视觉Transformer自回归模型美团开源原生多模态
Published 2026-03-29 14:35Recent activity 2026-03-31 10:52Estimated read 7 min
LongCat-Next: A Native Autoregressive Framework for Unified Discretization of Multimodal Information
1

Section 01

LongCat-Next: Introduction to the Native Multimodal Autoregressive Framework

Meituan's open-source LongCat-Next is a native autoregressive multimodal framework that unifies the discretization of text, visual, and audio information. It uses the DiNA framework to represent multimodal information as discrete tokens uniformly, employs the innovative dNaViT to enable arbitrary-resolution visual tokenization, and achieves unified capabilities of seeing (visual understanding), drawing (image generation), and speaking (voice interaction) under a single autoregressive objective. It addresses issues like fragmentation and poor modal fusion in traditional multimodal architectures and has been open-sourced to promote community development.

2

Section 02

Dilemmas of Current Multimodal Architectures

The Next-generation Token Prediction (NTP) paradigm has driven the success of large language models, but contemporary multimodal systems are still language-centric, treating non-linguistic modalities as external attachments, leading to two major issues: architectural fragmentation (different modalities require independent encoders/decoders) and poor inter-modal integration. Existing models mostly adopt a plug-in architecture of "visual encoder + projection layer + language model", where visual information is compressed into continuous vectors and mapped to the language embedding space, limiting detail capture capability and increasing training and inference complexity.

3

Section 03

DiNA: Core of the Discrete Native Autoregressive Framework

The core of the DiNA (Discrete Native Autoregressive) framework is to uniformly represent multimodal information in a shared discrete space, enabling consistent cross-modal autoregressive modeling. Its advantages include: architectural simplification (a single Transformer handles all modalities), deep fusion (token-level interaction), and a unified optimization objective (simplifies training and learns consistent cross-modal representations).

4

Section 04

dNaViT: Detailed Explanation of the Arbitrary-Resolution Visual Transformer

dNaViT (Discrete Native Arbitrary-Resolution Visual Transformer) is a core component of the DiNA framework, responsible for converting continuous visual signals into hierarchical discrete tokens and supporting arbitrary-resolution processing. It adopts a hierarchical tokenization strategy: first encode into a multi-scale feature pyramid, then vector quantize each scale (low scales capture global semantics, high scales retain local details); during decoding, it gradually upsamples and fuses to reconstruct high-quality outputs. Additionally, dNaViT can dynamically adjust the token grid size to efficiently process images of different sizes.

5

Section 05

LongCat-Next Model Architecture and Training Strategy

Based on the DiNA framework and dNaViT, LongCat-Next is a minimalist native multimodal model: its main body is a large-scale Transformer that receives mixed sequences of text/visual/audio tokens and autoregressively predicts the next token. The training uses a multi-stage strategy: unimodal pre-training (learning discrete representations of text/visual/audio separately), multimodal alignment training (learning cross-modal correlations using paired data), and instruction fine-tuning (completing tasks following human instructions).

6

Section 06

LongCat-Next Performance Breakthroughs and Evaluation Results

LongCat-Next performs strongly in multimodal benchmark tests: in visual understanding, it achieves for the first time the ability of discrete models to match continuous models; in image generation, it balances the conflict between understanding and generation; in audio processing, it enables end-to-end voice interaction (generating text from audio tokens or voice from text).

7

Section 07

Open-Source Contributions and Future Outlook

Meituan has open-sourced LongCat-Next and its tokenizer, including model weights/inference code, dNaViT training code/pretrained weights, data processing pipelines/training scripts, model cards/technical reports, providing a baseline for the community, promoting the popularization of discrete representations, and lowering research barriers. Technically, it marks a paradigm shift in multimodality from "language-dominant" to "modality equality", verifying hypotheses such as discrete representations carrying complex information and autoregression extending to multimodality. Future directions include exploring larger-scale data, more efficient tokenization algorithms, and incorporating more modalities (video/3D) into the unified framework.