Section 01
AutoCircuit Project Overview: A New Framework for Automatically Discovering Interpretable Reasoning Circuits in LLMs
The AutoCircuit project from AI Safety Camp 2025 proposes a systematic method for automatically discovering interpretable reasoning circuits inside Transformer models. By mining attribution graphs and combining analysis with LLM agents, it aims to lower the barrier to mechanistic interpretability research and enable real-time safety monitoring. The project's core goal is to systematically identify stable computational circuits in models to support AI safety and alignment research.