Zing Forum

Reading

Metis: Teaching Multimodal Agents to "Think Twice Before Acting" — HDPO Framework Solves Tool Overuse Problem

The research team from the Chinese University of Hong Kong proposes the HDPO framework, which addresses the over-reliance of agents on external tools by decoupling the reward mechanism. Experiments show that the Metis model reduces tool call frequency by several orders of magnitude while maintaining high accuracy, opening a new path for efficiency optimization of multimodal agents.

多模态智能体工具使用优化强化学习元认知HDPOMetisAI效率策略优化
Published 2026-04-12 17:43Recent activity 2026-04-12 18:20Estimated read 5 min
Metis: Teaching Multimodal Agents to "Think Twice Before Acting" — HDPO Framework Solves Tool Overuse Problem
1

Section 01

Metis & HDPO Framework: A Breakthrough in Multimodal Agent Tool Efficiency

Hong Kong Chinese University research team proposes the HDPO (Hierarchical Decoupled Policy Optimization) framework to address the tool overuse problem in multimodal agents. The Metis model trained with HDPO maintains high accuracy while reducing tool calls by several orders of magnitude, opening a new path for efficiency optimization of multimodal agents. This work aims to teach agents "think twice before acting" and develop metacognitive abilities.

2

Section 02

Background: The "Tool Dependency" Plague in Multimodal Agents

Multimodal agents with visual understanding can actively call external tools (search engines, calculators, APIs) but often overuse them—even for problems solvable with visual info alone. This brings two costs: frequent API calls cause delays, and redundant external info becomes noise. For example, an agent might call OCR then calculator for a simple image text calculation instead of doing it directly.

3

Section 03

Root Cause: Limitations of Traditional RL Penalty Mechanisms

Existing RL solutions use scalar penalties for tool calls, but face a dilemma: too strong → agents avoid tools even when needed (task failure); too weak → efficiency signals are drowned by accuracy reward variance. Traditional coupled rewards make accuracy and efficiency compete, hard to optimize both.

4

Section 04

HDPO Framework: Decoupling Accuracy and Efficiency Goals

HDPO framework decouples the two goals into orthogonal channels: 1) Accuracy channel: maximize task correctness without considering tool cost; 2) Efficiency channel: optimize tool use only on accurate trajectories via conditional advantage estimation. This "learn to walk then run" approach first builds task-solving ability then optimizes efficiency, simulating human metacognition.

5

Section 05

Technical Core: Conditional Advantage Estimation

Unlike traditional advantage estimation, conditional advantage estimation only uses successful trajectories. It compares tool efficiency among accurate paths—trajectories with fewer tool calls get positive efficiency signals, ensuring efficiency gains don't sacrifice accuracy.

6

Section 06

Experimental Evidence: Metis Delivers Remarkable Results

Metis model based on HDPO was evaluated on multiple multimodal benchmarks. It maintained or improved accuracy while reducing tool calls by several orders of magnitude. This shows agents can develop metacognition: judging whether to use tools or solve problems independently.

7

Section 07

Practical Impact: Cost, Speed, and Stability Improvements

For enterprises, HDPO reduces API costs (major operational expense), shortens response time, and enhances system stability (less reliance on external services). The work proves efficiency and ability aren't conflicting—"smart and thrifty" agents are more practical.

8

Section 08

Open Source & Future Directions

Metis code is open-sourced on GitHub. Future extensions: apply HDPO to other metacognitive abilities (time management, planning) to build general AI systems with self-monitoring and regulation.