Reading

Metis: Teaching Multimodal Agents to "Think Twice Before Acting" — HDPO Framework Solves Tool Overuse Problem

The research team from the Chinese University of Hong Kong proposes the HDPO framework, which addresses the over-reliance of agents on external tools by decoupling the reward mechanism. Experiments show that the Metis model reduces tool call frequency by several orders of magnitude while maintaining high accuracy, opening a new path for efficiency optimization of multimodal agents.

多模态智能体工具使用优化强化学习元认知HDPOMetisAI效率策略优化

Published 2026-04-12 17:43Recent activity 2026-04-12 18:20Estimated read 5 min

Metis: Teaching Multimodal Agents to "Think Twice Before Acting" — HDPO Framework Solves Tool Overuse Problem

Section 01

Metis & HDPO Framework: A Breakthrough in Multimodal Agent Tool Efficiency

Hong Kong Chinese University research team proposes the HDPO (Hierarchical Decoupled Policy Optimization) framework to address the tool overuse problem in multimodal agents. The Metis model trained with HDPO maintains high accuracy while reducing tool calls by several orders of magnitude, opening a new path for efficiency optimization of multimodal agents. This work aims to teach agents "think twice before acting" and develop metacognitive abilities.

Section 02

Background: The "Tool Dependency" Plague in Multimodal Agents

Multimodal agents with visual understanding can actively call external tools (search engines, calculators, APIs) but often overuse them—even for problems solvable with visual info alone. This brings two costs: frequent API calls cause delays, and redundant external info becomes noise. For example, an agent might call OCR then calculator for a simple image text calculation instead of doing it directly.

Section 03

Root Cause: Limitations of Traditional RL Penalty Mechanisms

Existing RL solutions use scalar penalties for tool calls, but face a dilemma: too strong → agents avoid tools even when needed (task failure); too weak → efficiency signals are drowned by accuracy reward variance. Traditional coupled rewards make accuracy and efficiency compete, hard to optimize both.

Section 04

HDPO Framework: Decoupling Accuracy and Efficiency Goals

HDPO framework decouples the two goals into orthogonal channels: 1) Accuracy channel: maximize task correctness without considering tool cost; 2) Efficiency channel: optimize tool use only on accurate trajectories via conditional advantage estimation. This "learn to walk then run" approach first builds task-solving ability then optimizes efficiency, simulating human metacognition.

Section 05

Technical Core: Conditional Advantage Estimation

Unlike traditional advantage estimation, conditional advantage estimation only uses successful trajectories. It compares tool efficiency among accurate paths—trajectories with fewer tool calls get positive efficiency signals, ensuring efficiency gains don't sacrifice accuracy.

Section 06

Experimental Evidence: Metis Delivers Remarkable Results

Metis model based on HDPO was evaluated on multiple multimodal benchmarks. It maintained or improved accuracy while reducing tool calls by several orders of magnitude. This shows agents can develop metacognition: judging whether to use tools or solve problems independently.

Section 07

Practical Impact: Cost, Speed, and Stability Improvements

For enterprises, HDPO reduces API costs (major operational expense), shortens response time, and enhances system stability (less reliance on external services). The work proves efficiency and ability aren't conflicting—"smart and thrifty" agents are more practical.

Section 08

Open Source & Future Directions

Metis code is open-sourced on GitHub. Future extensions: apply HDPO to other metacognitive abilities (time management, planning) to build general AI systems with self-monitoring and regulation.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15