Reading

Cambridge MPhil Thesis Open-Sourced: Reproducing Anthropic's Interpretability Research on Qwen3-4B

A master's thesis project from DAMPT at the University of Cambridge has for the first time reproduced Anthropic's mechanistic interpretability research methods on the open-source model Qwen3-4B, including transcoder feature extraction, attribution graph construction, and causal intervention validation, providing a complete open-source implementation for multilingual circuit analysis.

机械可解释性Qwen3-4B稀疏自编码器转码器归因图多语言模型因果干预剑桥大学开源AI神经网络可解释性

Published 2026-04-03 19:07Recent activity 2026-04-03 19:18Estimated read 7 min

Cambridge MPhil Thesis Open-Sourced: Reproducing Anthropic's Interpretability Research on Qwen3-4B

Section 01

Introduction: Cambridge MPhil Thesis Open-Sourced — Reproducing Anthropic's Mechanistic Interpretability Research on Qwen3-4B

Iuliia Vitiugova from the DAMPT department at the University of Cambridge recently open-sourced her master's thesis project, successfully reproducing the core methods of Anthropic's research 'On the Biology of Large Language Models' (transcoder feature extraction, attribution graph construction, causal intervention validation) on the open-source large language model Qwen3-4B. This fills a key gap in the mechanistic interpretability field of the open-source community and provides a complete reproducible technical framework for multilingual circuit analysis.

Section 02

Research Background: A Breakthrough in Mechanistic Interpretability from Closed-Source to Open-Source

Mechanistic interpretability aims to open the black box of neural networks and understand their internal computing mechanisms. In early 2025, Anthropic released groundbreaking research on Claude 3.5 Haiku, demonstrating methods for extracting interpretable features using sparse autoencoders (Transcoders) and constructing attribution graphs to track causal interactions. However, due to the closed-source nature of the model, the academic community found it difficult to reproduce and extend these methods. This project is the first to port these methods to the fully open-source Qwen3-4B, proving the technical universality and paving the way for subsequent research.

Section 03

Core Technical Methods: Three-Layer Progressive Analysis of Model Internal Mechanisms

Transcoder Feature Extraction: Deploy sparse autoencoders in the model's MLP layers (layers 10-25) to map high-dimensional activations to a 163840-dimensional sparse feature space (corresponding to human-understandable concepts); 2. Attribution Graph Construction: Build a graph with 94 feature nodes and 851 edges, distinguishing between star edges (direct association between features and outputs) and VW edges (information flow between features); 3. Causal Intervention Validation: Verify the causality of edges (not statistical correlation) through three methods: ablation, activation patching, and feature steering.

Section 04

Multilingual Circuit Analysis: Discovery of Cross-Lingual Shared Abstract Representations

Focusing on the multilingual antonym prediction task (multilingual_circuits_b1), it was found that the model uses shared 'bridge features' to handle cross-lingual concepts: 60.4% (32 out of 53) of key features are activated under both English and French inputs. The late layers (L22-L25) form two communities: one French-specific (84% French-biased) and the other bilingual balanced (89%), reflecting the division of labor strategy in multilingual processing.

Section 05

Causal Validation: Evidence of Cross-Lingual Mechanism Transfer Under Strict Standards

Three attributes of causal validation are proposed: 1. Directionality (intervening on features changes outputs); 2. Persistence (same features are activated when inputs change but mechanisms remain unchanged); 3. Replaceability (feature replacement produces predictable output changes). Cross-lingual injection tests (S2) show that 75% of English-French concept pairs (6 out of 8) have a strong transfer effect, with an average effect size of 0.371 (7 times that of degraded circuits), providing causal evidence for cross-lingual shared mechanisms.

Section 06

Computational Type Theory Framework: Four Types of Large Model Computational Patterns

A four-type taxonomy is proposed to describe large model computational patterns: 1. Hidden State Transfer (hierarchical transfer and transformation of information, e.g., geographic location reasoning); 2. Candidate Set Filtering (filtering outputs after generating candidates, e.g., grammatical consistency, antonym selection); 3. Abstract Mapping (surface input → abstract representation → surface output; multilingual circuits belong to this category); 4. Gated Decision (classifiers control information passing/blocking/redirection, e.g., safety filtering).

Section 07

Practical Significance, Limitations, and Future Directions

Significance: Prove that small open-source models (4B parameters) can achieve closed-source model-level interpretability analysis; provide complete code implementation (full set of scripts for prompt generation, baseline evaluation, feature extraction, etc.); reveal cross-lingual shared abstract representations to guide multilingual model safety alignment and capability editing. Limitations: The Qwen3-4B architecture's limitations lead to weak linear retention rates in some layers (due to the distributed 16-layer pipeline feature). Future Directions: Expand to larger open-source models (Qwen3-30B/Llama series), explore gated mechanisms (Type 4), and develop automated circuit discovery tools.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15