Zing Forum

Reading

π-Bench: Evaluating the Performance of Proactive Personal Assistant Agents in Long-Range Workflows

π-Bench is a benchmark specifically designed to evaluate the performance of proactive personal assistant agents in long-range workflows. It includes 100 multi-turn tasks and 5 domain-specific roles, measuring agent quality through two dimensions: proactivity and completeness.

智能体评估主动型AI长程工作流基准测试个人助手大语言模型AI BenchmarkAgent EvaluationProactive AI
Published 2026-05-30 01:15Recent activity 2026-05-30 01:19Estimated read 7 min
π-Bench: Evaluating the Performance of Proactive Personal Assistant Agents in Long-Range Workflows
1

Section 01

Introduction / Main Floor: π-Bench: Evaluating the Performance of Proactive Personal Assistant Agents in Long-Range Workflows

π-Bench is a benchmark specifically designed to evaluate the performance of proactive personal assistant agents in long-range workflows. It includes 100 multi-turn tasks and 5 domain-specific roles, measuring agent quality through two dimensions: proactivity and completeness.

2

Section 02

Original Authors and Source

3

Section 03

Background: Why Do We Need to Evaluate Proactive Agents?

Current large language models (LLMs) and agent systems mostly focus on short-term task execution capabilities, such as answering a single question, generating code once, or handling a single conversation. However, in the real world, personal assistants need to handle long-range workflows spanning hours or even days. These workflows often start with vague requirements, and important demands gradually emerge as interactions deepen.

Traditional benchmarks mainly focus on three aspects: short-term task execution, graphical interface/mobile device interaction, and pure memory retrieval capabilities. But these tests cannot truly measure whether an agent has "proactivity"—the ability to infer hidden intentions and take preemptive actions before the user explicitly expresses their needs. This is the background behind the birth of π-Bench.

4

Section 04

Core Design of π-Bench

π-Bench (pronounced Pi-Bench) is a benchmark specifically for proactive personal assistant agents. Its design has several key features:

First, it includes 100 multi-turn tasks distributed across 5 domain-specific role scenarios: researcher, marketer, pharmacist, law trainee, and financier. These roles represent real-world professional scenarios that require complex workflow management.

Second, these tasks are organized as multi-session episodes in a persistent workspace. This means the agent needs to maintain an understanding of the work state in a cross-session context and handle dependencies between tasks.

Most importantly, π-Bench introduces the concept of "hidden intent". The user's initial request is often incomplete, and important requirements gradually emerge during interactions. The agent needs to have the ability to infer these hidden intentions or clarify requirements through targeted inquiries.

5

Section 05

Evaluation Dimensions: Proactivity and Completeness

π-Bench evaluates agent performance from two core dimensions:

6

Section 06

Proactivity (PROC)

Proactivity measures whether the agent can resolve hidden intentions early. This includes two abilities: first, actively identifying the user's unexpressed needs through reasoning; second, guiding the user to clarify requirements through focused inquiries when necessary. Agents with high proactivity can reduce the user's burden and avoid having the user repeatedly supplement information in subsequent interactions.

7

Section 07

Completeness (COMP)

Completeness measures whether the final deliverable meets all checklist requirements and artifact-level obligations. Even if an agent shows high proactivity, if the final deliverable is incomplete, it still cannot get a high score. This dimension ensures that the agent not only "thinks ahead" but also "does thoroughly".

The scoring mechanism combines hidden intent judgment based on scoring criteria and checklist validation. Audit results show that the consistency among judges is high (disagreement rate below 4%), which supports the reliability of the evaluation results.

8

Section 08

Performance of Current Mainstream Models

The π-Bench team tested multiple mainstream large language models, and the results revealed some interesting findings:

In terms of average performance, GPT-5.4 leads in proactivity (67.0%), while Claude Opus 4.6 performs best in completeness (67.6%). This indicates that different models have trade-offs between proactive inference and complete execution.

From the perspective of roles, the performance of each model varies significantly across different domains. For example, Claude Opus 4.6 stands out in the law trainee scenario (completeness:74.5%), while GPT-5.4 is more proactive in marketing and finance scenarios. Although Kimi K2.5 has a lower average proactivity (43.1%), its completeness in the pharmacist scenario reaches 74.8%, indicating domain specificity in model capabilities.

Notably, all models have relatively low proactivity in the researcher scenario (29%-50%), which may reflect the complexity and vagueness of academic research workflows.