# POINTS-Seeker: Training a Multimodal Agent Search Model from Scratch

> This article introduces POINTS-Seeker-8B, which achieves breakthroughs in long-range knowledge-intensive visual reasoning through the Agentic Seeding phase and V-Fold history compression technology, and attains state-of-the-art performance in six benchmark tests.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-15T16:09:37.000Z
- 最近活动: 2026-04-16T01:52:47.312Z
- 热度: 139.3
- 关键词: 多模态搜索, 智能体模型, POINTS-Seeker, 视觉压缩, 长程推理, 知识检索, Agentic Seeding
- 页面链接: https://www.zingnex.cn/en/forum/thread/points-seeker
- Canonical: https://www.zingnex.cn/forum/thread/points-seeker
- Markdown 来源: floors_fallback

---

## POINTS-Seeker: Training a Multimodal Agent Search Model from Scratch (Introduction)

This article introduces POINTS-Seeker-8B, a multimodal agent search model trained from scratch. By establishing the foundation of agent behavior through the Agentic Seeding phase and combining V-Fold history compression technology to solve the bottleneck of long-range interaction, it achieves breakthroughs in long-range knowledge-intensive visual reasoning and attains state-of-the-art performance in six benchmark tests.

## Limitations of Existing Multimodal Search Paradigms

Current mainstream multimodal search methods add search tools to general large vision-language models (LMMs), but there are three major issues:
1. **Capability Misalignment**: The training objective of general LMMs is token prediction, which does not optimally utilize tools;
2. **Low Interaction Efficiency**: Search is not a core training component, requiring multiple rounds of attempts to obtain information;
3. **Difficulty in Long-Range Reasoning**: Accumulation of interaction history leads to reduced ability to locate key information.
The POINTS-Seeker team chose to design a dedicated model from scratch to overcome these limitations.

## Key Innovation 1: Agentic Seeding Phase

Agentic Seeding is a specially designed pre-training phase aimed at establishing the foundation of agent behavior:
- **Identify Knowledge Gaps**: Determine when external information is needed;
- **Formulate Search Strategies**: Decide what to search for and how based on the problem;
- **Integrate Retrieval Results**: Combine visual understanding with existing knowledge;
- **Plan Multi-Step Actions**: Design complex query plans.
Unlike simple tool training, it cultivates an agent thinking mode of active exploration and hypothesis verification.

## Key Innovation 2: V-Fold History Compression Technology

V-Fold solves the bottleneck of long-range interaction with core designs:
- **High-Fidelity Retention of Recent History**: Keep recent dialogue rounds intact;
- **Visual Compression of Distant History**: Convert early interactions into image representations;
- **Adaptive Switching**: Dynamically adjust the ratio of text retention to visual compression.
Advantages of visual compression: High information density, supports spatial relationship reasoning, and helps the model quickly grasp the historical context.

## POINTS-Seeker-8B Architecture and Training Process

### Architecture Components
- **Visual Encoder**: Advanced vision Transformer, processing high-resolution images;
- **Text Encoder and Generator**: Transformer modules responsible for query understanding, response generation, and search instructions;
- **Agent Core**: Dedicated module for decision-making, action planning, and result integration.
### Training Process
1. **Basic Pre-training**: Learn multimodal representations from large amounts of image-text data;
2. **Agentic Seeding**: Cultivate agent behavior in a synthetic environment;
3. **Supervised Fine-tuning**: Optimize performance with real task data.

## Experimental Results and Ablation Validation

### Benchmark Performance
POINTS-Seeker-8B leads in six benchmark tests:
- Knowledge-intensive visual question answering: Outperforms the tool-added paradigm;
- Multi-hop reasoning: V-Fold helps maintain long-range context;
- Long-range dialogue: Performance remains stable as the number of rounds increases;
- Cross-modal retrieval: Highlights the flexibility of the architecture.
### Ablation Experiments
- Removing Agentic Seeding: Significant performance drop in open-domain tasks;
- Removing V-Fold: Performance in long-range interaction drops sharply with increasing history length;
- V-Fold outperforms text truncation: Retains more structured information.

## Application Prospects, Limitations, and Future Directions

### Application Scenarios
- Intelligent research assistant: Literature/chart browsing and information synthesis;
- Multimodal customer service: Process images/documents and answer questions with knowledge bases;
- Educational tutoring: Personalized knowledge point retrieval and explanation;
- Medical image analysis: Assist diagnosis by combining images and literature.
### Limitations
- High computational cost: 8-billion-parameter model has high inference cost;
- Dependence on retrieval quality: Performance is affected by the quality of the underlying system;
- Safety and bias: May inherit issues from retrieval sources.
### Future Directions
- Larger-scale models: Explore the scaling effect of parameter expansion;
- Multimodal expansion: Support history compression for video/audio;
- Continuous learning: Improve search strategies from interactions.