Zing Forum

Reading

ODE: Strategy Data Evolution Method for Vision-Native Multimodal Deep Search Agents

ODE addresses the issues of visual evidence reuse and static training data in multimodal search through the Image Bank Reference Protocol and a closed-loop data generator, significantly improving agent performance across 8 benchmark tests.

多模态搜索智能体训练数据演化视觉推理工具使用Qwen3-VL强化学习监督微调
Published 2026-05-12 00:49Recent activity 2026-05-12 14:25Estimated read 5 min
ODE: Strategy Data Evolution Method for Vision-Native Multimodal Deep Search Agents
1

Section 01

[Introduction] ODE: Strategy Data Evolution Method for Vision-Native Multimodal Deep Search Agents

This paper proposes the ODE method, which solves the problem of visual evidence reuse in multimodal search through a vision-native framework (Image Bank Reference Protocol) and addresses static training data via a closed-loop data generator (ODE). It significantly improves agent performance across 8 benchmark tests—for example, the average score of Qwen3-VL-8B increased from 24.9% to 39.0%, surpassing Gemini-2.5 Pro (37.9%).

2

Section 02

Background: Core Challenges of Multimodal Deep Search

Multimodal deep search requires agents to chain tool calls, analyze images, and perform complex reasoning, but current systems face two major bottlenecks:

  1. Temporary Visual Evidence: Existing tool frameworks treat images as one-time outputs, so intermediate visual evidence cannot be reused by subsequent tools;
  2. Static Training Data: Data built through fixed processes cannot adapt to the dynamic evolution of strategy capabilities, leading to resource waste.
3

Section 03

Method 1: Vision-Native Agent Framework (Image Bank Mechanism)

The paper proposes a vision-native agent framework, with the core being the Image Bank Reference Protocol:

  • Register images returned by tools as addressable references and store them in an "Image Bank";
  • Subsequent tools can access historical images via reference IDs, enabling reuse of visual evidence in the reasoning chain;
  • Solve multi-step visual reasoning problems (e.g., map annotation → analysis), avoid repeated image generation/transmission, and improve efficiency and information integrity.
4

Section 04

Method 2: Closed-Loop Mechanism for Strategy Data Evolution (ODE)

ODE is a closed-loop data generator synchronized with strategy training, with the core process:

  1. Current strategy rollout generates execution trajectories;
  2. Analyze trajectories to identify success/failure patterns;
  3. Generate targeted training data to strengthen weak links;
  4. Train the strategy with new data and repeat. It supports data curation for Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), covering the full training lifecycle of agents.
5

Section 05

Experimental Evidence: Verification of Significant Performance Improvement

In 8 multimodal deep search benchmark tests, ODE showed significant effects:

  • Qwen3-VL-8B: Average score increased from 24.9% to 39.0% (+56%), surpassing Gemini-2.5 Pro (37.9%);
  • Qwen3-VL-30B: Average score increased from 30.6% to 41.5%. This proves that dynamic data generation can bridge the gap in model scale.
6

Section 06

Key Findings: Value of Image Bank and Dynamic Data

  1. Image Bank Reuse: In iterative visual optimization tasks, avoid repeated overhead and maintain information integrity;
  2. Rollout Feedback Advantage: Generate data based on the actual performance of the strategy, which is more aligned with task requirements than static synthesis;
  3. Dynamic Adaptation: Static data cannot adjust difficulty as the strategy improves, while ODE's round-by-round refinement avoids resource waste.
7

Section 07

Application Prospects and Future Directions

Application Prospects:

  • Extend to general agent training and Vision-Language Model (VLM) optimization;
  • Reduce multimodal data annotation costs;
  • Support continuous learning (iterative optimization after deployment). Limitations and Directions:
  • Optimize computational overhead;
  • Verify open-world generalization ability;
  • Improve interpretability;
  • Adapt to multi-agent collaboration scenarios.