Reading

SystemsBench: An Open-Source Benchmark Framework for Evaluating Systems Thinking Capabilities of Large Language Models

SystemsBench is an innovative open-source evaluation framework designed specifically to test the real systems thinking capabilities of large language models and intelligent agents. Based on Donella Meadows' systems thinking theory, it enables in-depth assessment of models' system reasoning abilities through a five-dimensional scoring system and a nine-stage recursive engine.

SystemsBench系统思维大语言模型评估基准测试Donella Meadows系统动力学AI安全开源框架递归引擎SenseRun

Published 2026-06-13 14:15Recent activity 2026-06-13 14:18Estimated read 7 min

SystemsBench: An Open-Source Benchmark Framework for Evaluating Systems Thinking Capabilities of Large Language Models

Section 01

SystemsBench: Introduction to the Open-Source Systems Thinking Evaluation Framework for Large Language Models

SystemsBench is an innovative open-source evaluation framework specifically designed to test the systems thinking capabilities of large language models and intelligent agents. Based on Donella Meadows' systems thinking theory, it achieves in-depth assessment through a five-dimensional scoring system (understanding of stocks and flows, identification of feedback loops, perception of time delays, localization of leverage points, and paradigm reflection) and a nine-stage recursive engine (SenseRun ritual), with self-evolution and self-correction features. The project is maintained by InitiumBuilders/Outlier.Systems, and its open-source address is https://github.com/InitiumBuilders/SystemsBench.

Section 02

Why Systems Thinking Evaluation Is Critical for Large Language Models

Current benchmark tests for large language models mostly focus on knowledge memory and pattern matching, lacking assessment of complex system understanding capabilities. Systems thinking (understanding stocks and flows, feedback loops, time delays, leverage points, and deep paradigms) is the key to distinguishing between "smart calculators" and "true understanders". The uniqueness of SystemsBench lies in that it is a living system capable of self-evolution rather than a static test set, applying systems thinking discipline to itself.

Section 03

Core Design Philosophy and Recursive Self-Improvement Mechanism of SystemsBench

Core Design Philosophy

Inheriting Meadows' Theory: The assessment is built around her system intervention hierarchy (especially the leverage point theory).
Five-dimensional Evaluation System: Covers five dimensions: stocks and flows, feedback loops, time delays, leverage points, and paradigm reflection.

Recursive Self-Improvement Engine (SenseRun Ritual)

Nine-stage process: SENSE→CRITIQUE→RESEARCH→PROPOSE→REVIEW→APPLY→CALIBRATE→LOG→RECURSE.

Reversibility: Each APPLY generates a Git commit, supporting clean rollback.
Governance Gating: Cumulative changes are applied automatically, while structural changes require manual approval.

Section 04

Project Architecture and File Organization of SystemsBench

The codebase structure of SystemsBench embodies systems thinking:

Documentation: SystemsBenchOnePage.MD (quick start), SystemsBenchStructure.MD (scoring system/question format), SystemsBenchEngine.MD (recursive engine), etc.
Functional directories: engine/ (executable SenseRun engine), items/ (question bank), rubrics/ (scoring criteria), logs/runs/ (SenseRun logs), etc. This structure serves as both a tool and a living textbook for systems thinking.

Section 05

Anti-Pollution Measures and Current Development Status of SystemsBench

Anti-Pollution and Evaluation Integrity

Transparent Annotation: Currently in v0.5.0 (Genesis+) research preview phase, the gold standard set is temporary (1/30, synthetic scorer), with honest annotations instead of false certifications; uncalibrated metrics are marked as UNCALIBRATED.
Addressing Meta-Problems: Solving the "who evaluates the evaluators" problem through recursive self-application, maintaining humility and openness.

Current Status

6 SenseRun log records have been completed, and it is in a rapid iteration phase, maintained by Outlier.Systems.

Section 06

Practical Significance and Community Participation Suggestions for SystemsBench

Practical Significance

AI Developers: Reveals how models "think", aiding in intelligent agent design (avoiding catastrophic failures), multi-agent coordination, and AI safety research (identifying blind spots).
Education: Serves as a learning resource for systems thinking, demonstrating system dynamics principles.

Suggestions

The project adopts an open-source model; community contributions are welcome to jointly promote the evolution of the framework.

Section 07

SystemsBench: The Shift from Static Testing to Dynamic Systems Thinking Evaluation

SystemsBench represents an important shift in large language model evaluation: from static knowledge testing to dynamic capability observation, from isolated metrics to holistic system understanding. Its value lies not only in what it measures but also in the way it measures and its evaluation philosophy of continuous self-improvement. In the era of rapid AI development, this attitude of self-reflection is more meaningful than specific scores.