# ODE: Strategy Data Evolution Method for Vision-Native Multimodal Deep Search Agents

> ODE addresses the issues of visual evidence reuse and static training data in multimodal search through the Image Bank Reference Protocol and a closed-loop data generator, significantly improving agent performance across 8 benchmark tests.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-11T16:49:36.000Z
- 最近活动: 2026-05-12T06:25:37.908Z
- 热度: 137.4
- 关键词: 多模态搜索, 智能体训练, 数据演化, 视觉推理, 工具使用, Qwen3-VL, 强化学习, 监督微调
- 页面链接: https://www.zingnex.cn/en/forum/thread/ode
- Canonical: https://www.zingnex.cn/forum/thread/ode
- Markdown 来源: floors_fallback

---

## [Introduction] ODE: Strategy Data Evolution Method for Vision-Native Multimodal Deep Search Agents

This paper proposes the ODE method, which solves the problem of visual evidence reuse in multimodal search through a vision-native framework (Image Bank Reference Protocol) and addresses static training data via a closed-loop data generator (ODE). It significantly improves agent performance across 8 benchmark tests—for example, the average score of Qwen3-VL-8B increased from 24.9% to 39.0%, surpassing Gemini-2.5 Pro (37.9%).

## Background: Core Challenges of Multimodal Deep Search

Multimodal deep search requires agents to chain tool calls, analyze images, and perform complex reasoning, but current systems face two major bottlenecks:
1. **Temporary Visual Evidence**: Existing tool frameworks treat images as one-time outputs, so intermediate visual evidence cannot be reused by subsequent tools;
2. **Static Training Data**: Data built through fixed processes cannot adapt to the dynamic evolution of strategy capabilities, leading to resource waste.

## Method 1: Vision-Native Agent Framework (Image Bank Mechanism)

The paper proposes a vision-native agent framework, with the core being the **Image Bank Reference Protocol**:
- Register images returned by tools as addressable references and store them in an "Image Bank";
- Subsequent tools can access historical images via reference IDs, enabling reuse of visual evidence in the reasoning chain;
- Solve multi-step visual reasoning problems (e.g., map annotation → analysis), avoid repeated image generation/transmission, and improve efficiency and information integrity.

## Method 2: Closed-Loop Mechanism for Strategy Data Evolution (ODE)

ODE is a closed-loop data generator synchronized with strategy training, with the core process:
1. Current strategy rollout generates execution trajectories;
2. Analyze trajectories to identify success/failure patterns;
3. Generate targeted training data to strengthen weak links;
4. Train the strategy with new data and repeat.
It supports data curation for Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), covering the full training lifecycle of agents.

## Experimental Evidence: Verification of Significant Performance Improvement

In 8 multimodal deep search benchmark tests, ODE showed significant effects:
- **Qwen3-VL-8B**: Average score increased from 24.9% to 39.0% (+56%), surpassing Gemini-2.5 Pro (37.9%);
- **Qwen3-VL-30B**: Average score increased from 30.6% to 41.5%.
This proves that dynamic data generation can bridge the gap in model scale.

## Key Findings: Value of Image Bank and Dynamic Data

1. **Image Bank Reuse**: In iterative visual optimization tasks, avoid repeated overhead and maintain information integrity;
2. **Rollout Feedback Advantage**: Generate data based on the actual performance of the strategy, which is more aligned with task requirements than static synthesis;
3. **Dynamic Adaptation**: Static data cannot adjust difficulty as the strategy improves, while ODE's round-by-round refinement avoids resource waste.

## Application Prospects and Future Directions

**Application Prospects**:
- Extend to general agent training and Vision-Language Model (VLM) optimization;
- Reduce multimodal data annotation costs;
- Support continuous learning (iterative optimization after deployment).
**Limitations and Directions**:
- Optimize computational overhead;
- Verify open-world generalization ability;
- Improve interpretability;
- Adapt to multi-agent collaboration scenarios.
