VMRRB-Benchmark adopts a modular design, allowing researchers to flexibly configure test scenarios. Its core technical features include:
Scenario Generator: Based on predefined templates and randomized parameters, it automatically generates test cases with specific complexity characteristics. Each case is carefully designed to ensure coverage of specific combinations of the four dimensions mentioned above.
Evaluation Metric System: In addition to traditional accuracy metrics, VMRRB also introduces:
- Reasoning Path Completeness: Evaluates whether the model follows reasonable intermediate steps
- Error Propagation Analysis: Tracks how initial errors affect subsequent reasoning
- Recovery Ability Score: Measures the model's ability to self-correct from error states
Multi-model Comparison Framework: Supports simultaneous testing of multiple LLMs (e.g., GPT-4, Claude, Llama, etc.) and generates detailed comparison reports to help developers select the most suitable model for specific scenarios.