Zing Forum

Reading

Cambridge MPhil Thesis Open-Sourced: Reproducing Anthropic's Interpretability Research on Qwen3-4B

A master's thesis project from DAMPT at the University of Cambridge has for the first time reproduced Anthropic's mechanistic interpretability research methods on the open-source model Qwen3-4B, including transcoder feature extraction, attribution graph construction, and causal intervention validation, providing a complete open-source implementation for multilingual circuit analysis.

机械可解释性Qwen3-4B稀疏自编码器转码器归因图多语言模型因果干预剑桥大学开源AI神经网络可解释性
Published 2026-04-03 19:07Recent activity 2026-04-03 19:18Estimated read 7 min
Cambridge MPhil Thesis Open-Sourced: Reproducing Anthropic's Interpretability Research on Qwen3-4B
1

Section 01

Introduction: Cambridge MPhil Thesis Open-Sourced — Reproducing Anthropic's Mechanistic Interpretability Research on Qwen3-4B

Iuliia Vitiugova from the DAMPT department at the University of Cambridge recently open-sourced her master's thesis project, successfully reproducing the core methods of Anthropic's research 'On the Biology of Large Language Models' (transcoder feature extraction, attribution graph construction, causal intervention validation) on the open-source large language model Qwen3-4B. This fills a key gap in the mechanistic interpretability field of the open-source community and provides a complete reproducible technical framework for multilingual circuit analysis.

2

Section 02

Research Background: A Breakthrough in Mechanistic Interpretability from Closed-Source to Open-Source

Mechanistic interpretability aims to open the black box of neural networks and understand their internal computing mechanisms. In early 2025, Anthropic released groundbreaking research on Claude 3.5 Haiku, demonstrating methods for extracting interpretable features using sparse autoencoders (Transcoders) and constructing attribution graphs to track causal interactions. However, due to the closed-source nature of the model, the academic community found it difficult to reproduce and extend these methods. This project is the first to port these methods to the fully open-source Qwen3-4B, proving the technical universality and paving the way for subsequent research.

3

Section 03

Core Technical Methods: Three-Layer Progressive Analysis of Model Internal Mechanisms

  1. Transcoder Feature Extraction: Deploy sparse autoencoders in the model's MLP layers (layers 10-25) to map high-dimensional activations to a 163840-dimensional sparse feature space (corresponding to human-understandable concepts); 2. Attribution Graph Construction: Build a graph with 94 feature nodes and 851 edges, distinguishing between star edges (direct association between features and outputs) and VW edges (information flow between features); 3. Causal Intervention Validation: Verify the causality of edges (not statistical correlation) through three methods: ablation, activation patching, and feature steering.
4

Section 04

Multilingual Circuit Analysis: Discovery of Cross-Lingual Shared Abstract Representations

Focusing on the multilingual antonym prediction task (multilingual_circuits_b1), it was found that the model uses shared 'bridge features' to handle cross-lingual concepts: 60.4% (32 out of 53) of key features are activated under both English and French inputs. The late layers (L22-L25) form two communities: one French-specific (84% French-biased) and the other bilingual balanced (89%), reflecting the division of labor strategy in multilingual processing.

5

Section 05

Causal Validation: Evidence of Cross-Lingual Mechanism Transfer Under Strict Standards

Three attributes of causal validation are proposed: 1. Directionality (intervening on features changes outputs); 2. Persistence (same features are activated when inputs change but mechanisms remain unchanged); 3. Replaceability (feature replacement produces predictable output changes). Cross-lingual injection tests (S2) show that 75% of English-French concept pairs (6 out of 8) have a strong transfer effect, with an average effect size of 0.371 (7 times that of degraded circuits), providing causal evidence for cross-lingual shared mechanisms.

6

Section 06

Computational Type Theory Framework: Four Types of Large Model Computational Patterns

A four-type taxonomy is proposed to describe large model computational patterns: 1. Hidden State Transfer (hierarchical transfer and transformation of information, e.g., geographic location reasoning); 2. Candidate Set Filtering (filtering outputs after generating candidates, e.g., grammatical consistency, antonym selection); 3. Abstract Mapping (surface input → abstract representation → surface output; multilingual circuits belong to this category); 4. Gated Decision (classifiers control information passing/blocking/redirection, e.g., safety filtering).

7

Section 07

Practical Significance, Limitations, and Future Directions

Significance: Prove that small open-source models (4B parameters) can achieve closed-source model-level interpretability analysis; provide complete code implementation (full set of scripts for prompt generation, baseline evaluation, feature extraction, etc.); reveal cross-lingual shared abstract representations to guide multilingual model safety alignment and capability editing. Limitations: The Qwen3-4B architecture's limitations lead to weak linear retention rates in some layers (due to the distributed 16-layer pipeline feature). Future Directions: Expand to larger open-source models (Qwen3-30B/Llama series), explore gated mechanisms (Type 4), and develop automated circuit discovery tools.