# ParaVT: A Multi-Agent Parallel Video Tool Calling Framework to Resolve the Tool Prior Paradox

> ParaVT is the first end-to-end RL-trained multi-agent parallel video tool calling framework. It addresses the error propagation and context contamination issues of serial calling by invoking multiple time window cropping tools in a single call. The PARA-GRPO algorithm is proposed to resolve the tool prior paradox, achieving an average improvement of 7.9% across 6 long video understanding benchmarks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-19T18:01:26.000Z
- 最近活动: 2026-05-21T02:51:40.438Z
- 热度: 118.2
- 关键词: 多模态模型, 强化学习, 视频理解, 工具调用, 多智能体, GRPO, 长视频, 智能体
- 页面链接: https://www.zingnex.cn/en/forum/thread/paravt
- Canonical: https://www.zingnex.cn/forum/thread/paravt
- Markdown 来源: floors_fallback

---

## ParaVT Framework Guide: A Multi-Agent Parallel Video Tool Calling Solution to Resolve the Tool Prior Paradox

ParaVT is the first end-to-end RL-trained multi-agent parallel video tool calling framework. Its core innovation lies in invoking multiple time window cropping tools simultaneously in a single dialogue turn, addressing the error propagation, context contamination, and inference cost issues of serial calling. The framework proposes the PARA-GRPO algorithm to tackle the tool prior paradox, achieving an average performance improvement of 7.9% across 6 long video understanding benchmarks.

## Tool Calling Challenges in Long Video Understanding

Large multimodal models (LMMs) face context window capacity limitations when processing long videos, requiring tool calls to extend their perceptual capabilities. Existing RL-based tool calling methods mostly use serial mode, which has defects such as uncorrectable propagation of single wrong cropping, context contamination from multi-turn calls, and inference cost growing linearly with the number of turns.

## ParaVT Framework Design: Multi-Agent Parallelism and End-to-End RL Training

ParaVT adopts a multi-agent architecture where the main model generates multiple cropping instructions, and sub-agents process the corresponding segment features in parallel and aggregate the results, compressing multi-turn tasks into a single turn to reduce latency. The framework uses end-to-end RL training, where the model autonomously learns the optimal cropping strategy through interaction with the environment instead of relying on imitation of manual annotations.

## Discovery and Verification of the Tool Prior Paradox

When applying standard RL to ParaVT, the tool prior paradox was discovered: the tool prior formed by LMM pre-training leads to format collapse (unparseable output) and tool-skipping shortcuts (directly guessing answers). Cross-model verification shows that weak-prior models have stable formats but cannot trigger tool calls, confirming that prior is both a necessary condition and a training threat.

## PARA-GRPO Algorithm: Key Mechanism to Resolve the Tool Prior Paradox

PARA-GRPO adds two mechanisms to the standard GRPO: 1. Targeted format reward: Apply rewards only at structured token positions prone to collapse, stabilizing the format while preserving exploration space; 2. Frame budget randomization: Randomly vary the frame budget to force the model to call tools instead of relying on shortcuts.

## Experimental Results: Significant Improvements in Performance and Efficiency

ParaVT achieves an average improvement of 7.9% across 6 long video understanding benchmarks (including action recognition, temporal localization, and video question answering); PARA-GRPO increases the format compliance rate from 0.13 to 0.64; parallel calling compresses multi-turn interactions into a single turn, reducing inference latency proportionally with the number of turns.

## Research Insights and Future Directions

ParaVT reveals that RL training needs to collaborate with pre-trained priors rather than confront them. Limitations include only supporting video cropping tools and high system complexity; future directions are to expand complex tool chains, explore the tool prior paradox in other domains, and develop general prior-aware RL algorithms.
