Reading

JoyAI-Image: JD Open-Source Unified Multimodal Foundation Model Enabling Closed-Loop Collaboration of Image Understanding, Generation, and Editing

JoyAI-Image, an open-source model by JD, is a 24B-parameter unified multimodal foundation model. It achieves deep integration of three core capabilities—image understanding, text-to-image generation, and instruction-guided image editing—through a collaborative architecture combining an 8B multimodal large language model (MLLM) and a 16B multimodal diffusion Transformer (MMDiT).

多模态模型图像生成图像编辑扩散模型空间理解长文本渲染京东开源Apache-2.0

Published 2026-04-02 23:43Recent activity 2026-04-02 23:50Estimated read 6 min

JoyAI-Image: JD Open-Source Unified Multimodal Foundation Model Enabling Closed-Loop Collaboration of Image Understanding, Generation, and Editing

Section 01

【Introduction】JD Open-Sources JoyAI-Image: Unified Multimodal Model Enables Closed-Loop Collaboration of Image Understanding, Generation, and Editing

JoyAI-Image, open-sourced by JD, is a 24B-parameter unified multimodal foundation model. It deeply integrates three core capabilities—image understanding, text-to-image generation, and instruction-guided image editing—via a collaborative architecture combining an 8B multimodal large language model (MLLM) and a 16B multimodal diffusion Transformer (MMDiT), forming an "Understand-Generate-Edit" closed loop. The model boasts advantages like strong spatial understanding, long text rendering, and controllable spatial editing, and is open-sourced under the Apache-2.0 license.

Section 02

Project Background and Core Design Philosophy

JoyAI-Image is a comprehensive AI system open-sourced by JD. Its core design philosophy is the "Understand-Generate-Edit" closed-loop collaboration: stronger spatial understanding enhances scene generation and controllable editing effects, while generative transformations (e.g., perspective changes) provide supplementary evidence for spatial reasoning. The model combines the 8B MLLM and 16B MMDiT to achieve unified processing of the three tasks.

Section 03

Technical Architecture and Core Innovations

JoyAI-Image adopts an MLLM-MMDiT shared interface design to enable collaboration and knowledge sharing across understanding, generation, and editing tasks. In terms of spatial intelligence, it combines understanding and generation via a bidirectional loop mechanism, possessing stronger spatial understanding, controllable spatial editing, and perspective-assisted reasoning capabilities. It can comprehend spatial relationships in images and generate/edit logically consistent content.

Section 04

Demonstration of Long Text Rendering and Spatial Editing Capabilities

Long Text Rendering: Optimized to handle complex text scenarios (multi-panel comics, multi-line text, multilingual typesetting, etc.), maintaining layout fidelity and typesetting effects. Suitable for e-commerce product images, graphic creation, and other scenarios.

Spatial Editing: Supports three modes: object movement (automatic occlusion/lighting handling), object rotation (multi-view standard rotation), and camera control (adjusting yaw/pitch/zoom), ensuring scene consistency and instruction compliance.

Section 05

Training Data and Optimization Strategy

The model uses an extensible data pipeline covering spatial understanding data (OpenSpatial), long text rendering data, editing data, etc. It is paired with a multi-stage optimization strategy to ensure balanced performance across all tasks. Spatial data enhances spatial relationship understanding, long text data strengthens text scene processing, and editing data facilitates precise instruction-based modifications.

Section 06

Practical Applications and Inference Support

Complete inference code and parameter descriptions are provided, supporting three main tasks:

Image understanding: Multi-image input, enabling image comparison and description;
Generation/editing: Controlled by natural language instructions, supporting parameters like output size and random seed;
Prompt rewriting: Optimizes input prompts based on LLM to improve generation quality. Users can call it via the command-line interface.

Section 07

Open-Source Ecosystem and Future Outlook

JoyAI-Image is open-sourced under the Apache 2.0 license, with weights released on HuggingFace. The JD team is recruiting relevant personnel to focus on the research, development, and implementation of next-generation generative models. This project provides a multimodal tool for academia and industry, especially in cutting-edge fields like spatial understanding and long text rendering. We look forward to community participation to drive its development.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15