Zing Forum

Reading

RecA: Unleashing the Zero-Shot Potential of Unified Multimodal Models via Reconstruction Alignment

An open-source project for ICLR 2026, proposing a self-supervised reconstruction alignment method. With only 1.5B parameters, it outperforms models of 7B-24B scale and achieves SOTA performance in image generation and editing tasks.

multimodal modelself-supervised learningimage generationimage editingreconstruction alignmentICLR 2026BAGELHarmonShow-oOpenUni
Published 2026-05-15 03:06Recent activity 2026-05-15 03:18Estimated read 8 min
RecA: Unleashing the Zero-Shot Potential of Unified Multimodal Models via Reconstruction Alignment
1

Section 01

[Introduction] RecA: Zero-Shot Breakthrough of Small-Parameter Unified Multimodal Models

RecA is an open-source project for ICLR 2026, proposing a self-supervised reconstruction alignment method. With only 1.5B parameters, it outperforms models of 7B-24B scale and achieves SOTA performance in image generation and editing tasks. This thread will introduce the project's background, core methods, performance breakthroughs, application ecosystem, and future outlook in separate floors.

2

Section 02

Background: Development Bottlenecks of Unified Multimodal Models

Background: Bottlenecks of Unified Multimodal Models

In recent years, Unified Multimodal Models (UMM) have become a hot topic in AI research, with representative works including Show-o, OpenUni, Harmon, and BAGEL. However, such models face core challenges: how to achieve zero-shot generalization across diverse tasks while maintaining generation quality. Traditional multimodal models rely on large amounts of labeled data or reinforcement learning, increasing training costs and limiting adaptability to new tasks. Exploring efficient self-supervised methods is key.

3

Section 03

RecA Core: Self-Supervised Method of Reconstruction Alignment

Core Idea of RecA: Self-Supervised Method of Reconstruction Alignment

The core concept of RecA (Reconstruction Alignment) is to achieve deep alignment of multimodal representations through input reconstruction under a self-supervised framework. Its uniqueness lies in not relying on GPT-4o distillation data or reinforcement learning—only through self-supervised training can it outperform larger-scale models, which is particularly advantageous in scenarios with limited computing resources.

4

Section 04

Technical Implementation: Cross-Architecture Validation and Resource Support

Technical Implementation: Cross-Architecture Validation and Resource Support

RecA has been validated on multiple mainstream unified multimodal architectures: Show-o (image generation model based on CLIP and VQGAN), OpenUni (unified multimodal understanding series), Harmon (high-resolution image generation model), and BAGEL (multimodal model developed by ByteDance's Seed team). The project provides complete training and evaluation code, detailed guides, and RecA-optimized model weights (supporting precisions like BF16, NF4, INT8, DF11) to facilitate deployment on different hardware.

5

Section 05

Performance Breakthrough: The Counterattack of Small-Parameter Models

Performance Breakthrough: The Counterattack of Small-Parameter Models

Image Generation Tasks

RecA-tuned models perform excellently on GenEval and DPGBench benchmarks:

Model Parameter Count GenEval DPGBench
Harmon-1.5B-RecA 1.5B 85.7 (+12.8) 87.21 (+6.28)
OpenUni-2-1.6B-RecA 3.6B 74.1 (+12.2) 82.75 (+3.73)
BAGEL-RecA 14B 82.4 (+3.6) 85.29 (+1.26)

Harmon-1.5B-RecA, with only 1.5B parameters, outperforms many models of 7B-24B scale. After combining with GPT-4o-Image distillation, Harmon-1.5B-RecA-plus achieves 90.0 on GenEval and 88.15 on DPGBench.

Image Editing Capability

On ImgEdit and GEdit benchmarks, BAGEL-RecA improves by 0.37 and 0.33 points respectively compared to the base model, and its editing quality is comparable to SOTA methods like ICEdit, FLUX-Kontext, and GPT-4o.

6

Section 06

Practical Applications: Ecosystem Integration and Convenient Deployment

Practical Applications and Ecosystem Integration

The project provides multiple usage methods:

  • Hugging Face Online Demo: Experience BAGEL-RecA's image generation/editing capabilities directly in the browser without local configuration;
  • ComfyUI Support: Integrated with the ComfyUI-BAGEL project, supporting NF4/INT8 quantization to reduce memory requirements;
  • Local Deployment Guide: Detailed installation and inference guides, as well as Jupyter Notebook examples, to facilitate developers' onboarding.
7

Section 07

Research Significance and Future Outlook

Research Significance and Future Outlook

Research Significance

  1. Self-Supervised Potential: Well-designed self-supervised objectives can unleash the inherent capabilities of models without expensive labeling or complex post-training;
  2. Parameter Efficiency: Small-parameter models can match large models through better alignment mechanisms, which is important for resource-constrained scenarios;
  3. Cross-Architecture Generality: RecA has been validated across multiple architectures, and reconstruction alignment is a general representation learning method.

Future Outlook

The team plans to expand the training scale of BAGEL, support new architectures like Janus-Pro/Show-o2, and continuously optimize performance. The code and weights are fully open-source, and it is expected to become a baseline for UMM research. Chinese and English reproduction guides are provided to help developers reproduce the results.