Zing Forum

Reading

ATLAS: A New Paradigm Unifying Agentic and Implicit Visual Reasoning with a Single Token

The ATLAS framework unifies agentic reasoning and implicit visual reasoning into a single discrete token via "functional tokens". It avoids external execution latency while retaining interpretability, and introduces LA-GRPO for stable training.

视觉推理多模态大模型功能词元ATLASGRPO强化学习代理式AI隐式推理词元预测可解释AI
Published 2026-05-15 01:59Recent activity 2026-05-16 01:18Estimated read 6 min
ATLAS: A New Paradigm Unifying Agentic and Implicit Visual Reasoning with a Single Token
1

Section 01

ATLAS Framework: A New Paradigm Unifying Agentic and Implicit Visual Reasoning with Functional Tokens

The ATLAS framework is a new visual reasoning paradigm proposed by institutions including the Chinese University of Hong Kong and Shanghai Artificial Intelligence Laboratory. Its core innovation is unifying agentic reasoning and implicit visual reasoning into a single discrete token via functional tokens. This design eliminates the external execution latency of agentic reasoning while retaining interpretability; it also introduces the LA-GRPO algorithm to solve the sparsity problem in functional token training, achieving a win-win between performance and interpretability.

2

Section 02

Background: The Dilemma of Visual Reasoning

Visual reasoning needs to handle intermediate visual states, but the two existing technical routes have limitations:

  • Agentic reasoning: Manipulates visual content via code/external tools, with strong interpretability but high context switching overhead and slow reasoning speed;
  • Implicit reasoning: Uses internal hidden embeddings to represent visual states, fast but lacks generalization ability and is difficult to be compatible with autoregressive parallel training.
3

Section 03

Core of ATLAS: Threefold Design of Functional Tokens

Functional tokens are the core of ATLAS, with a threefold design:

  1. Internalized visual operations: Associates internal visual operations (e.g., rotation, zooming) without external tools, eliminating latency;
  2. Standard token attributes: Belongs to the tokenizer vocabulary, can be generated via standard token prediction without modifying the model architecture;
  3. No visual supervision needed: Automatically learned through end-to-end task objectives (e.g., correctness of question answering) without explicit visual annotations.
4

Section 04

LA-GRPO: Key Algorithm to Solve Sparsity in Functional Token Training

Functional token training faces sparsity challenges in the early stage (extremely small proportion, weak gradient signals). The LA-GRPO algorithm introduces statically weighted auxiliary objectives and sets anchor loss terms for functional tokens. Even if there are few functional tokens in a batch, it can provide stable gradients, retaining the sample efficiency of GRPO while solving the training instability problem.

5

Section 05

Experimental Validation: Performance of ATLAS on Multiple Tasks

ATLAS performs excellently on multiple visual reasoning benchmarks:

  • Geometric reasoning: In precise spatial relationship judgment tasks, functional tokens clearly show the reasoning process;
  • Visual question answering: In complex multi-step reasoning QA tasks, it leads in accuracy and can explain logic via functional token sequences;
  • Baseline comparison: The reasoning latency is reduced by an order of magnitude compared to pure agentic methods, and its generalization ability and training stability are better than pure implicit methods.
6

Section 06

Technical Significance and Future Directions: Discrete Tokens Connecting Symbolic and Neural Reasoning

The significance of ATLAS lies in revealing that discrete tokens can serve as a bridge between symbolic reasoning and neural computing, unifying agentic (symbolic, interpretable) and neural reasoning (continuous, efficient). Future prospects include:

  1. Internalization of tool learning: Internalize common tool functions into functional tokens;
  2. Unified multi-modal representation: Use functional tokens as multi-modal operation interfaces;
  3. Enhanced interpretability: Discrete tokens make the reasoning process transparent, suitable for high-risk scenarios.