Zing Forum

Reading

JoyAI-Image: JD Open-Source Unified Multimodal Foundation Model Enabling Closed-Loop Collaboration of Image Understanding, Generation, and Editing

JoyAI-Image, an open-source model by JD, is a 24B-parameter unified multimodal foundation model. It achieves deep integration of three core capabilities—image understanding, text-to-image generation, and instruction-guided image editing—through a collaborative architecture combining an 8B multimodal large language model (MLLM) and a 16B multimodal diffusion Transformer (MMDiT).

多模态模型图像生成图像编辑扩散模型空间理解长文本渲染京东开源Apache-2.0
Published 2026-04-02 23:43Recent activity 2026-04-02 23:50Estimated read 6 min
JoyAI-Image: JD Open-Source Unified Multimodal Foundation Model Enabling Closed-Loop Collaboration of Image Understanding, Generation, and Editing
1

Section 01

【Introduction】JD Open-Sources JoyAI-Image: Unified Multimodal Model Enables Closed-Loop Collaboration of Image Understanding, Generation, and Editing

JoyAI-Image, open-sourced by JD, is a 24B-parameter unified multimodal foundation model. It deeply integrates three core capabilities—image understanding, text-to-image generation, and instruction-guided image editing—via a collaborative architecture combining an 8B multimodal large language model (MLLM) and a 16B multimodal diffusion Transformer (MMDiT), forming an "Understand-Generate-Edit" closed loop. The model boasts advantages like strong spatial understanding, long text rendering, and controllable spatial editing, and is open-sourced under the Apache-2.0 license.

2

Section 02

Project Background and Core Design Philosophy

JoyAI-Image is a comprehensive AI system open-sourced by JD. Its core design philosophy is the "Understand-Generate-Edit" closed-loop collaboration: stronger spatial understanding enhances scene generation and controllable editing effects, while generative transformations (e.g., perspective changes) provide supplementary evidence for spatial reasoning. The model combines the 8B MLLM and 16B MMDiT to achieve unified processing of the three tasks.

3

Section 03

Technical Architecture and Core Innovations

JoyAI-Image adopts an MLLM-MMDiT shared interface design to enable collaboration and knowledge sharing across understanding, generation, and editing tasks. In terms of spatial intelligence, it combines understanding and generation via a bidirectional loop mechanism, possessing stronger spatial understanding, controllable spatial editing, and perspective-assisted reasoning capabilities. It can comprehend spatial relationships in images and generate/edit logically consistent content.

4

Section 04

Demonstration of Long Text Rendering and Spatial Editing Capabilities

Long Text Rendering: Optimized to handle complex text scenarios (multi-panel comics, multi-line text, multilingual typesetting, etc.), maintaining layout fidelity and typesetting effects. Suitable for e-commerce product images, graphic creation, and other scenarios.

Spatial Editing: Supports three modes: object movement (automatic occlusion/lighting handling), object rotation (multi-view standard rotation), and camera control (adjusting yaw/pitch/zoom), ensuring scene consistency and instruction compliance.

5

Section 05

Training Data and Optimization Strategy

The model uses an extensible data pipeline covering spatial understanding data (OpenSpatial), long text rendering data, editing data, etc. It is paired with a multi-stage optimization strategy to ensure balanced performance across all tasks. Spatial data enhances spatial relationship understanding, long text data strengthens text scene processing, and editing data facilitates precise instruction-based modifications.

6

Section 06

Practical Applications and Inference Support

Complete inference code and parameter descriptions are provided, supporting three main tasks:

  • Image understanding: Multi-image input, enabling image comparison and description;
  • Generation/editing: Controlled by natural language instructions, supporting parameters like output size and random seed;
  • Prompt rewriting: Optimizes input prompts based on LLM to improve generation quality. Users can call it via the command-line interface.
7

Section 07

Open-Source Ecosystem and Future Outlook

JoyAI-Image is open-sourced under the Apache 2.0 license, with weights released on HuggingFace. The JD team is recruiting relevant personnel to focus on the research, development, and implementation of next-generation generative models. This project provides a multimodal tool for academia and industry, especially in cutting-edge fields like spatial understanding and long text rendering. We look forward to community participation to drive its development.