Zing Forum

Reading

MuCo: NAVER AI Lab's Multi-turn Contrastive Learning Multimodal Embedding Model

A CVPR 2026 accepted work that trains a multimodal embedding model via multi-turn conversational contrastive learning, achieving SOTA performance on the MMEB benchmark with 70.1 points for the 2B model and 74.2 points for the 7B model.

多模态对比学习嵌入模型NAVERCVPRCLIP跨模态检索PyTorch
Published 2026-04-09 16:34Recent activity 2026-04-09 16:47Estimated read 6 min
MuCo: NAVER AI Lab's Multi-turn Contrastive Learning Multimodal Embedding Model
1

Section 01

MuCo: Introduction to NAVER AI Lab's Multi-turn Contrastive Learning Multimodal Embedding Model

The MuCo (Multi-turn Contrastive Learning) multimodal embedding model proposed by NAVER AI Lab has been accepted by CVPR 2026. Trained via multi-turn conversational contrastive learning, it achieves SOTA performance on the MMEB benchmark (70.1 points for the 2B model and 74.2 points for the 7B model). The related pre-trained models, M3T dataset, and paper have been open-sourced, and the complete training code will be released soon. This model provides a new paradigm for multimodal embedding training.

2

Section 02

Evolution of Multimodal Embedding Models and Limitations of Traditional Contrastive Learning

Multimodal embedding models are bridges connecting modalities like text and images, and are crucial in scenarios such as cross-modal retrieval. Traditional contrastive learning trains by pulling matched pairs closer and pushing unmatched pairs apart, but has limitations like uneven quality of negative samples, lack of progressive learning, and ignoring fine-grained relationships, making it difficult to capture complex semantic differences.

3

Section 03

Core of MuCo: Design of Multi-turn Conversational Contrastive Learning

MuCo transforms contrastive learning into a multi-turn conversational process: the first round distinguishes obviously unmatched samples; subsequent rounds construct hard negative samples based on the output of the previous round; dynamically adjusts the contrast difficulty to simulate the human learning process from easy to difficult, solving the limitations of traditional methods.

4

Section 04

MuCo's Technical Architecture and Training Support

Model Architecture: Based on a vision-language pre-training architecture, it offers 2B (70.1 points) and 7B (74.2 points) versions. The HuggingFace links are naver-ai/MuCo-2B and naver-ai/MuCo-7B respectively. Dataset: Relies on the M3T multi-turn annotated dataset (naver-ai/M3T) built by NAVER, which features progressive difficulty and large scale. Training Strategy: Includes components like multi-turn sampler, difficulty scheduler, and temperature coefficient annealing.

5

Section 05

MuCo's Experimental Results and Ablation Analysis

MMEB Benchmark Performance: MuCo-2B (70.1 points) shows significant improvement over CLIP of the same scale; MuCo-7B (74.2 points) approaches the performance of larger models and has good scalability. Ablation Experiments: Increasing the number of rounds improves fine-grained semantic capture ability; dynamic negative sampling is better than random sampling; although multi-turn learning increases the computational cost per round, the total number of training steps is reduced, leading to higher efficiency.

6

Section 06

Comparison of MuCo with Related Works and Application Scenarios

Comparison with Related Works: Compared to SimCLR (single-modal), CLIP (simple negative sampling), and ALIGN (data volume-dependent), MuCo is centered on multi-turn progressive contrast and requires multi-turn annotated data. Application Scenarios: Suitable for cross-modal retrieval (text-to-image/image-to-text), fine-grained semantic understanding (visual question answering, image caption generation, multimodal reasoning), etc.

7

Section 07

MuCo's Open-Source Ecosystem and Future Directions

Open-Source Resources: Pre-trained models, M3T dataset, and paper (arXiv:2602.06393) have been released; the complete training code is scheduled to be released from April 13 to 17. Future Directions: Promote multi-turn learning to other representation learning tasks; explore directions like adaptive rounds, multi-agent contrast, and cross-task transfer.

8

Section 08

MuCo's Team Background and Research Summary

Team: Co-completed by researchers from NAVER AI Lab and Korea University, with Geonmo Gu as the first author and core contributors including Byeongho Heo, etc. Summary: MuCo provides new ideas for multimodal embedding training through the multi-turn contrastive learning paradigm. Its SOTA performance proves its effectiveness, and the open-source resources will promote community innovation, making it worthy of attention in the field of multimodal learning.