Section 01
[Introduction] STARFlow2: Achieving Truly Unified Multimodal Generation with Autoregressive Normalizing Flows
STARFlow2 addresses the current architectural dilemmas in multimodal generation and proposes a unified solution based on autoregressive normalizing flows. The core innovation lies in leveraging the isomorphism between autoregressive normalizing flows and Transformers (sharing causal masks and KV caches), fusing VLM flows and TarFlow flows via the Pretzel vertically interleaved architecture, constructing a unified FAE latent space to achieve unified generation and understanding of text and images. It also has cache-friendly interleaved generation capabilities, solving the structural mismatch problem of concatenated architectures.