Zing Forum

Reading

Structured Testing of Multi-Agent Workflows: From End-to-End Success Rate to Coverage Validation

Existing evaluations rely on end-to-end task success rates, making it difficult to verify whether the claimed coordination structure is actually triggered. The new study proposes structural coverage criteria, generating executable tests for 403 structural obligations via typed coordination graphs and DSPy scenario generation.

多智能体测试结构化覆盖率OpenAI Agents SDKDSPy对抗测试工作流验证端到端测试覆盖率义务智能体系统
Published 2026-05-26 12:07Recent activity 2026-05-27 14:27Estimated read 7 min
Structured Testing of Multi-Agent Workflows: From End-to-End Success Rate to Coverage Validation
1

Section 01

[Introduction] Structured Testing of Multi-Agent Workflows: A Breakthrough from End-to-End to Structural Coverage

Core Insight: Existing multi-agent system evaluations rely on end-to-end task success rates, which cannot verify whether the claimed coordination structure is actually triggered. A study published on arXiv in May 2026 proposes a structured coverage testing method. Using typed coordination graphs, coverage obligation derivation, and DSPy scenario generation, it generates executable tests for 403 structural obligations, supplementing the shortcomings of end-to-end testing and revealing structural defects such as zombie agents and ghost tools.

2

Section 02

Testing Dilemma of Multi-Agent Systems: Blind Spots of End-to-End Testing

As the complexity of LLM multi-agent systems increases, workflows include multiple roles, tool sets, access rules, constraints, and delegation paths. However, existing tests only focus on end-to-end results, leading to blind spots:

  • An agent is never invoked
  • Some tool access rules are not verified
  • Constraints never take effect
  • Delegation paths exist but are not used This is like software testing that only looks at output without checking code branches, easily missing structural defects.
3

Section 03

Structured Testing Method: Typed Coordination Graphs and Coverage Obligation Derivation

Core Steps of Structured Testing:

  1. Typed Coordination Graph: Nodes are agents; edges represent tool calls, restricted calls, and delegation relationships, with interaction types labeled.
  2. Coverage Obligation Derivation: Need to verify each agent's triggering, allowed tool calls, adversarial testing of restricted tools, and execution of delegation paths.
  3. DSPy Scenario Generation: Convert obligations into natural language scenarios for runtime verification. Innovation: Adversarial testing of restricted tools—proactively attempting to violate forbidden calls to verify the effectiveness of restriction mechanisms (triggered 23/248 violations in 10 SDK benchmarks).
4

Section 04

Experimental Validation: Coverage Results and Defect Discovery on OpenAI Agents SDK

10 benchmark tests based on OpenAI Agents SDK:

  • 49 agents, 47 tools, 403 obligations Coverage Results:
  • Allowed tools: 54/75 (72%)
  • Delegation obligations:36/48 (75%)
  • Restriction violation triggers:23/248 (9.3%) Discovered Defects: Zombie agents, ghost tools, paper constraints, dead-end delegations—these are easily overlooked in end-to-end testing.
5

Section 05

Technical Depth: Complementary Value of End-to-End and Structured Testing

Limitations of End-to-End Testing: Opaque paths, incomplete coverage, weak regression detection. Value of Structured Testing: Explicitly verify the triggering of structural elements, strong regression detection, validate design intent, align with documentation. Analogy to Software Testing:

Software Testing Multi-Agent Testing
Line coverage Agent trigger coverage
Branch coverage Tool call path coverage
Boundary testing Adversarial testing of restriction rules
Integration testing Delegation path validation
6

Section 06

Practical Applications: Structured Testing Implementation Scenarios from Development to Production

Application Scenarios:

  • Development Phase: Identify unused agents/tools, verify coverage of new structures.
  • Code Review: Include coverage reports as part of PR reviews.
  • CI/CD: Integrate structural coverage into pipelines; trigger alerts if coverage drops.
  • Production Monitoring: Collect actual coverage and compare with expected values.
7

Section 07

Current Limitations and Future Directions: Scenario Generation and Dynamic Structure Expansion

Limitations:

  • Scenario generation quality depends on DSPy prompts and model capabilities.
  • Some obligations are hard to trigger via natural language.
  • Only supports static workflow structures.
  • Does not verify if triggers are correct (needs to combine with semantic testing). Future Directions:
  • Efficient scenario generation algorithms.
  • Reinforcement learning-based adversarial testing.
  • Dynamic workflow expansion.
  • Coverage visualization tools.
8

Section 08

Conclusion: Structured Testing—A New Dimension in Multi-Agent Quality Assurance

Structured coverage testing adds a new dimension to multi-agent quality assurance, answering the question: 'Is the designed structure actually used?' As multi-agent deployments become widespread, structured testing may become a standard practice, similar to the role of code coverage in software engineering.