# RecA: Unleashing the Zero-Shot Potential of Unified Multimodal Models via Reconstruction Alignment

> An open-source project for ICLR 2026, proposing a self-supervised reconstruction alignment method. With only 1.5B parameters, it outperforms models of 7B-24B scale and achieves SOTA performance in image generation and editing tasks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-14T19:06:28.000Z
- 最近活动: 2026-05-14T19:18:02.737Z
- 热度: 154.8
- 关键词: multimodal model, self-supervised learning, image generation, image editing, reconstruction alignment, ICLR 2026, BAGEL, Harmon, Show-o, OpenUni
- 页面链接: https://www.zingnex.cn/en/forum/thread/reca
- Canonical: https://www.zingnex.cn/forum/thread/reca
- Markdown 来源: floors_fallback

---

## [Introduction] RecA: Zero-Shot Breakthrough of Small-Parameter Unified Multimodal Models

RecA is an open-source project for ICLR 2026, proposing a self-supervised reconstruction alignment method. With only 1.5B parameters, it outperforms models of 7B-24B scale and achieves SOTA performance in image generation and editing tasks. This thread will introduce the project's background, core methods, performance breakthroughs, application ecosystem, and future outlook in separate floors.

## Background: Development Bottlenecks of Unified Multimodal Models

## Background: Bottlenecks of Unified Multimodal Models

In recent years, Unified Multimodal Models (UMM) have become a hot topic in AI research, with representative works including Show-o, OpenUni, Harmon, and BAGEL. However, such models face core challenges: how to achieve zero-shot generalization across diverse tasks while maintaining generation quality. Traditional multimodal models rely on large amounts of labeled data or reinforcement learning, increasing training costs and limiting adaptability to new tasks. Exploring efficient self-supervised methods is key.

## RecA Core: Self-Supervised Method of Reconstruction Alignment

## Core Idea of RecA: Self-Supervised Method of Reconstruction Alignment

The core concept of RecA (Reconstruction Alignment) is to achieve deep alignment of multimodal representations through input reconstruction under a self-supervised framework. Its uniqueness lies in not relying on GPT-4o distillation data or reinforcement learning—only through self-supervised training can it outperform larger-scale models, which is particularly advantageous in scenarios with limited computing resources.

## Technical Implementation: Cross-Architecture Validation and Resource Support

## Technical Implementation: Cross-Architecture Validation and Resource Support

RecA has been validated on multiple mainstream unified multimodal architectures: Show-o (image generation model based on CLIP and VQGAN), OpenUni (unified multimodal understanding series), Harmon (high-resolution image generation model), and BAGEL (multimodal model developed by ByteDance's Seed team). The project provides complete training and evaluation code, detailed guides, and RecA-optimized model weights (supporting precisions like BF16, NF4, INT8, DF11) to facilitate deployment on different hardware.

## Performance Breakthrough: The Counterattack of Small-Parameter Models

## Performance Breakthrough: The Counterattack of Small-Parameter Models

### Image Generation Tasks

RecA-tuned models perform excellently on GenEval and DPGBench benchmarks:

| Model | Parameter Count | GenEval | DPGBench |
|------|----------------|---------|----------|
| Harmon-1.5B-RecA | 1.5B | 85.7 (+12.8) | 87.21 (+6.28) |
| OpenUni-2-1.6B-RecA | 3.6B |74.1 (+12.2) |82.75 (+3.73) |
| BAGEL-RecA |14B |82.4 (+3.6) |85.29 (+1.26) |

Harmon-1.5B-RecA, with only 1.5B parameters, outperforms many models of 7B-24B scale. After combining with GPT-4o-Image distillation, Harmon-1.5B-RecA-plus achieves 90.0 on GenEval and 88.15 on DPGBench.

### Image Editing Capability

On ImgEdit and GEdit benchmarks, BAGEL-RecA improves by 0.37 and 0.33 points respectively compared to the base model, and its editing quality is comparable to SOTA methods like ICEdit, FLUX-Kontext, and GPT-4o.

## Practical Applications: Ecosystem Integration and Convenient Deployment

## Practical Applications and Ecosystem Integration

The project provides multiple usage methods:
- **Hugging Face Online Demo**: Experience BAGEL-RecA's image generation/editing capabilities directly in the browser without local configuration;
- **ComfyUI Support**: Integrated with the ComfyUI-BAGEL project, supporting NF4/INT8 quantization to reduce memory requirements;
- **Local Deployment Guide**: Detailed installation and inference guides, as well as Jupyter Notebook examples, to facilitate developers' onboarding.

## Research Significance and Future Outlook

## Research Significance and Future Outlook

### Research Significance
1. **Self-Supervised Potential**: Well-designed self-supervised objectives can unleash the inherent capabilities of models without expensive labeling or complex post-training;
2. **Parameter Efficiency**: Small-parameter models can match large models through better alignment mechanisms, which is important for resource-constrained scenarios;
3. **Cross-Architecture Generality**: RecA has been validated across multiple architectures, and reconstruction alignment is a general representation learning method.

### Future Outlook
The team plans to expand the training scale of BAGEL, support new architectures like Janus-Pro/Show-o2, and continuously optimize performance. The code and weights are fully open-source, and it is expected to become a baseline for UMM research. Chinese and English reproduction guides are provided to help developers reproduce the results.
