Zing Forum

Reading

POINTS-Seeker: Training a Multimodal Agent Search Model from Scratch

This article introduces POINTS-Seeker-8B, which achieves breakthroughs in long-range knowledge-intensive visual reasoning through the Agentic Seeding phase and V-Fold history compression technology, and attains state-of-the-art performance in six benchmark tests.

多模态搜索智能体模型POINTS-Seeker视觉压缩长程推理知识检索Agentic Seeding
Published 2026-04-16 00:09Recent activity 2026-04-16 09:52Estimated read 7 min
POINTS-Seeker: Training a Multimodal Agent Search Model from Scratch
1

Section 01

POINTS-Seeker: Training a Multimodal Agent Search Model from Scratch (Introduction)

This article introduces POINTS-Seeker-8B, a multimodal agent search model trained from scratch. By establishing the foundation of agent behavior through the Agentic Seeding phase and combining V-Fold history compression technology to solve the bottleneck of long-range interaction, it achieves breakthroughs in long-range knowledge-intensive visual reasoning and attains state-of-the-art performance in six benchmark tests.

2

Section 02

Limitations of Existing Multimodal Search Paradigms

Current mainstream multimodal search methods add search tools to general large vision-language models (LMMs), but there are three major issues:

  1. Capability Misalignment: The training objective of general LMMs is token prediction, which does not optimally utilize tools;
  2. Low Interaction Efficiency: Search is not a core training component, requiring multiple rounds of attempts to obtain information;
  3. Difficulty in Long-Range Reasoning: Accumulation of interaction history leads to reduced ability to locate key information. The POINTS-Seeker team chose to design a dedicated model from scratch to overcome these limitations.
3

Section 03

Key Innovation 1: Agentic Seeding Phase

Agentic Seeding is a specially designed pre-training phase aimed at establishing the foundation of agent behavior:

  • Identify Knowledge Gaps: Determine when external information is needed;
  • Formulate Search Strategies: Decide what to search for and how based on the problem;
  • Integrate Retrieval Results: Combine visual understanding with existing knowledge;
  • Plan Multi-Step Actions: Design complex query plans. Unlike simple tool training, it cultivates an agent thinking mode of active exploration and hypothesis verification.
4

Section 04

Key Innovation 2: V-Fold History Compression Technology

V-Fold solves the bottleneck of long-range interaction with core designs:

  • High-Fidelity Retention of Recent History: Keep recent dialogue rounds intact;
  • Visual Compression of Distant History: Convert early interactions into image representations;
  • Adaptive Switching: Dynamically adjust the ratio of text retention to visual compression. Advantages of visual compression: High information density, supports spatial relationship reasoning, and helps the model quickly grasp the historical context.
5

Section 05

POINTS-Seeker-8B Architecture and Training Process

Architecture Components

  • Visual Encoder: Advanced vision Transformer, processing high-resolution images;
  • Text Encoder and Generator: Transformer modules responsible for query understanding, response generation, and search instructions;
  • Agent Core: Dedicated module for decision-making, action planning, and result integration.

Training Process

  1. Basic Pre-training: Learn multimodal representations from large amounts of image-text data;
  2. Agentic Seeding: Cultivate agent behavior in a synthetic environment;
  3. Supervised Fine-tuning: Optimize performance with real task data.
6

Section 06

Experimental Results and Ablation Validation

Benchmark Performance

POINTS-Seeker-8B leads in six benchmark tests:

  • Knowledge-intensive visual question answering: Outperforms the tool-added paradigm;
  • Multi-hop reasoning: V-Fold helps maintain long-range context;
  • Long-range dialogue: Performance remains stable as the number of rounds increases;
  • Cross-modal retrieval: Highlights the flexibility of the architecture.

Ablation Experiments

  • Removing Agentic Seeding: Significant performance drop in open-domain tasks;
  • Removing V-Fold: Performance in long-range interaction drops sharply with increasing history length;
  • V-Fold outperforms text truncation: Retains more structured information.
7

Section 07

Application Prospects, Limitations, and Future Directions

Application Scenarios

  • Intelligent research assistant: Literature/chart browsing and information synthesis;
  • Multimodal customer service: Process images/documents and answer questions with knowledge bases;
  • Educational tutoring: Personalized knowledge point retrieval and explanation;
  • Medical image analysis: Assist diagnosis by combining images and literature.

Limitations

  • High computational cost: 8-billion-parameter model has high inference cost;
  • Dependence on retrieval quality: Performance is affected by the quality of the underlying system;
  • Safety and bias: May inherit issues from retrieval sources.

Future Directions

  • Larger-scale models: Explore the scaling effect of parameter expansion;
  • Multimodal expansion: Support history compression for video/audio;
  • Continuous learning: Improve search strategies from interactions.