# SystemsBench: An Open-Source Benchmark Framework for Evaluating Systems Thinking Capabilities of Large Language Models

> SystemsBench is an innovative open-source evaluation framework designed specifically to test the real systems thinking capabilities of large language models and intelligent agents. Based on Donella Meadows' systems thinking theory, it enables in-depth assessment of models' system reasoning abilities through a five-dimensional scoring system and a nine-stage recursive engine.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-13T06:15:51.000Z
- 最近活动: 2026-06-13T06:18:57.355Z
- 热度: 154.9
- 关键词: SystemsBench, 系统思维, 大语言模型评估, 基准测试, Donella Meadows, 系统动力学, AI安全, 开源框架, 递归引擎, SenseRun
- 页面链接: https://www.zingnex.cn/en/forum/thread/systemsbench
- Canonical: https://www.zingnex.cn/forum/thread/systemsbench
- Markdown 来源: floors_fallback

---

## SystemsBench: Introduction to the Open-Source Systems Thinking Evaluation Framework for Large Language Models

SystemsBench is an innovative open-source evaluation framework specifically designed to test the systems thinking capabilities of large language models and intelligent agents. Based on Donella Meadows' systems thinking theory, it achieves in-depth assessment through a five-dimensional scoring system (understanding of stocks and flows, identification of feedback loops, perception of time delays, localization of leverage points, and paradigm reflection) and a nine-stage recursive engine (SenseRun ritual), with self-evolution and self-correction features. The project is maintained by InitiumBuilders/Outlier.Systems, and its open-source address is https://github.com/InitiumBuilders/SystemsBench.

## Why Systems Thinking Evaluation Is Critical for Large Language Models

Current benchmark tests for large language models mostly focus on knowledge memory and pattern matching, lacking assessment of complex system understanding capabilities. Systems thinking (understanding stocks and flows, feedback loops, time delays, leverage points, and deep paradigms) is the key to distinguishing between "smart calculators" and "true understanders". The uniqueness of SystemsBench lies in that it is a living system capable of self-evolution rather than a static test set, applying systems thinking discipline to itself.

## Core Design Philosophy and Recursive Self-Improvement Mechanism of SystemsBench

### Core Design Philosophy
- **Inheriting Meadows' Theory**: The assessment is built around her system intervention hierarchy (especially the leverage point theory).
- **Five-dimensional Evaluation System**: Covers five dimensions: stocks and flows, feedback loops, time delays, leverage points, and paradigm reflection.

### Recursive Self-Improvement Engine (SenseRun Ritual)
Nine-stage process: SENSE→CRITIQUE→RESEARCH→PROPOSE→REVIEW→APPLY→CALIBRATE→LOG→RECURSE.
- **Reversibility**: Each APPLY generates a Git commit, supporting clean rollback.
- **Governance Gating**: Cumulative changes are applied automatically, while structural changes require manual approval.

## Project Architecture and File Organization of SystemsBench

The codebase structure of SystemsBench embodies systems thinking:
- Documentation: SystemsBenchOnePage.MD (quick start), SystemsBenchStructure.MD (scoring system/question format), SystemsBenchEngine.MD (recursive engine), etc.
- Functional directories: engine/ (executable SenseRun engine), items/ (question bank), rubrics/ (scoring criteria), logs/runs/ (SenseRun logs), etc.
This structure serves as both a tool and a living textbook for systems thinking.

## Anti-Pollution Measures and Current Development Status of SystemsBench

### Anti-Pollution and Evaluation Integrity
- Transparent Annotation: Currently in v0.5.0 (Genesis+) research preview phase, the gold standard set is temporary (1/30, synthetic scorer), with honest annotations instead of false certifications; uncalibrated metrics are marked as UNCALIBRATED.
- Addressing Meta-Problems: Solving the "who evaluates the evaluators" problem through recursive self-application, maintaining humility and openness.

### Current Status
6 SenseRun log records have been completed, and it is in a rapid iteration phase, maintained by Outlier.Systems.

## Practical Significance and Community Participation Suggestions for SystemsBench

### Practical Significance
- **AI Developers**: Reveals how models "think", aiding in intelligent agent design (avoiding catastrophic failures), multi-agent coordination, and AI safety research (identifying blind spots).
- **Education**: Serves as a learning resource for systems thinking, demonstrating system dynamics principles.

### Suggestions
The project adopts an open-source model; community contributions are welcome to jointly promote the evolution of the framework.

## SystemsBench: The Shift from Static Testing to Dynamic Systems Thinking Evaluation

SystemsBench represents an important shift in large language model evaluation: from static knowledge testing to dynamic capability observation, from isolated metrics to holistic system understanding. Its value lies not only in what it measures but also in the way it measures and its evaluation philosophy of continuous self-improvement. In the era of rapid AI development, this attitude of self-reflection is more meaningful than specific scores.
