The π-Bench team tested multiple mainstream large language models, and the results revealed some interesting findings:
In terms of average performance, GPT-5.4 leads in proactivity (67.0%), while Claude Opus 4.6 performs best in completeness (67.6%). This indicates that different models have trade-offs between proactive inference and complete execution.
From the perspective of roles, the performance of each model varies significantly across different domains. For example, Claude Opus 4.6 stands out in the law trainee scenario (completeness:74.5%), while GPT-5.4 is more proactive in marketing and finance scenarios. Although Kimi K2.5 has a lower average proactivity (43.1%), its completeness in the pharmacist scenario reaches 74.8%, indicating domain specificity in model capabilities.
Notably, all models have relatively low proactivity in the researcher scenario (29%-50%), which may reflect the complexity and vagueness of academic research workflows.