Reading

POINTS-Seeker: Training a Multimodal Agent Search Model from Scratch

This article introduces POINTS-Seeker-8B, which achieves breakthroughs in long-range knowledge-intensive visual reasoning through the Agentic Seeding phase and V-Fold history compression technology, and attains state-of-the-art performance in six benchmark tests.

多模态搜索智能体模型POINTS-Seeker视觉压缩长程推理知识检索Agentic Seeding

Published 2026-04-16 00:09Recent activity 2026-04-16 09:52Estimated read 7 min

Section 01

POINTS-Seeker: Training a Multimodal Agent Search Model from Scratch (Introduction)

This article introduces POINTS-Seeker-8B, a multimodal agent search model trained from scratch. By establishing the foundation of agent behavior through the Agentic Seeding phase and combining V-Fold history compression technology to solve the bottleneck of long-range interaction, it achieves breakthroughs in long-range knowledge-intensive visual reasoning and attains state-of-the-art performance in six benchmark tests.

Section 02

Limitations of Existing Multimodal Search Paradigms

Current mainstream multimodal search methods add search tools to general large vision-language models (LMMs), but there are three major issues:

Capability Misalignment: The training objective of general LMMs is token prediction, which does not optimally utilize tools;
Low Interaction Efficiency: Search is not a core training component, requiring multiple rounds of attempts to obtain information;
Difficulty in Long-Range Reasoning: Accumulation of interaction history leads to reduced ability to locate key information. The POINTS-Seeker team chose to design a dedicated model from scratch to overcome these limitations.

Section 03

Key Innovation 1: Agentic Seeding Phase

Agentic Seeding is a specially designed pre-training phase aimed at establishing the foundation of agent behavior:

Identify Knowledge Gaps: Determine when external information is needed;
Formulate Search Strategies: Decide what to search for and how based on the problem;
Integrate Retrieval Results: Combine visual understanding with existing knowledge;
Plan Multi-Step Actions: Design complex query plans. Unlike simple tool training, it cultivates an agent thinking mode of active exploration and hypothesis verification.

Section 04

Key Innovation 2: V-Fold History Compression Technology

V-Fold solves the bottleneck of long-range interaction with core designs:

High-Fidelity Retention of Recent History: Keep recent dialogue rounds intact;
Visual Compression of Distant History: Convert early interactions into image representations;
Adaptive Switching: Dynamically adjust the ratio of text retention to visual compression. Advantages of visual compression: High information density, supports spatial relationship reasoning, and helps the model quickly grasp the historical context.

Section 05

POINTS-Seeker-8B Architecture and Training Process

Architecture Components

Visual Encoder: Advanced vision Transformer, processing high-resolution images;
Text Encoder and Generator: Transformer modules responsible for query understanding, response generation, and search instructions;
Agent Core: Dedicated module for decision-making, action planning, and result integration.

Training Process

Basic Pre-training: Learn multimodal representations from large amounts of image-text data;
Agentic Seeding: Cultivate agent behavior in a synthetic environment;
Supervised Fine-tuning: Optimize performance with real task data.

Section 06

Experimental Results and Ablation Validation

Benchmark Performance

POINTS-Seeker-8B leads in six benchmark tests:

Knowledge-intensive visual question answering: Outperforms the tool-added paradigm;
Multi-hop reasoning: V-Fold helps maintain long-range context;
Long-range dialogue: Performance remains stable as the number of rounds increases;
Cross-modal retrieval: Highlights the flexibility of the architecture.

Ablation Experiments

Removing Agentic Seeding: Significant performance drop in open-domain tasks;
Removing V-Fold: Performance in long-range interaction drops sharply with increasing history length;
V-Fold outperforms text truncation: Retains more structured information.

Section 07

Application Prospects, Limitations, and Future Directions

Application Scenarios

Intelligent research assistant: Literature/chart browsing and information synthesis;
Multimodal customer service: Process images/documents and answer questions with knowledge bases;
Educational tutoring: Personalized knowledge point retrieval and explanation;
Medical image analysis: Assist diagnosis by combining images and literature.

Limitations

High computational cost: 8-billion-parameter model has high inference cost;
Dependence on retrieval quality: Performance is affected by the quality of the underlying system;
Safety and bias: May inherit issues from retrieval sources.

Future Directions

Larger-scale models: Explore the scaling effect of parameter expansion;
Multimodal expansion: Support history compression for video/audio;
Continuous learning: Improve search strategies from interactions.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15