Zing Forum

Reading

Quantifying Cross-Large Language Model Feature Space Universality Using Sparse Autoencoders

A cutting-edge study on the geometric similarity of feature spaces from sparse autoencoders of different large language models, which pairs features via activation correlation and measures the relational similarity of the geometry of decoder weights.

sparse autoencoderfeature spacelarge language modelsmechanistic interpretabilitySVCCARSAcross-model alignmentneural network interpretability
Published 2026-05-14 07:54Recent activity 2026-05-14 07:59Estimated read 8 min
Quantifying Cross-Large Language Model Feature Space Universality Using Sparse Autoencoders
1

Section 01

Introduction: Quantifying LLM Feature Space Universality Using Sparse Autoencoders

This study focuses on the geometric similarity of feature spaces across different large language models (LLMs). It decomposes the internal activation patterns of models into interpretable feature sets using sparse autoencoders (SAEs), pairs cross-model features via activation correlation, and quantifies feature space universality using methods like SVCCA and RSA. The research aims to reveal whether models of different architectures/scales share internal representation rules, providing new tools and perspectives for mechanistic interpretability, model alignment safety, and knowledge transfer.

2

Section 02

Research Background and Motivation

With the rapid development of LLMs, a core question arises: Do models of different architectures/scales learn similar internal representations? Traditionally, it was thought that models independently discover language rules, but increasing evidence suggests they may share a 'universal language'. Sparse autoencoders (SAEs) can decompose activation patterns into semantic features, but cross-model features may not align directly. The innovation of this study: Does the geometric structure of feature spaces still exhibit similarity even if individual features cannot be directly aligned?

3

Section 03

Core Methodology

Activation Correlation Method for Feature Pairing

Pairing SAE features from different models via activation correlation: Features with similar activation patterns on the same input text are considered potential correspondences, which is more flexible than label matching.

Relational Similarity Metrics

  • SVCCA: Compares the canonical correlation of feature decoder weights to quantify spatial alignment;
  • RSA: Calculates the correlation of distance matrices between features to capture the similarity of overall geometric structures;
  • Baseline methods: Direct cosine similarity and other comparison benchmarks.

Semantic Subspace Analysis

Explores changes in similarity across different semantic subspaces (e.g., differences in cross-model alignment between mathematical reasoning vs. sentiment analysis subspaces).

4

Section 04

Technical Implementation Highlights

Interactive Feature Space Visualization

Provides the pythia_feature_mapping_viz.py script to generate self-contained HTML pages with two UMAP panels (corresponding to the SAE feature spaces of two models). The same batch of text is input into both models, features are mapped via batch activation correlation, decoder directions are dimensionality-reduced using UMAP, and users can hover/select to highlight corresponding features across panels.

Experimental Framework and Reproducibility

Supports cross-layer/cross-scale comparisons of Pythia series models, with configuration of batch size, sequence length, number of random runs, and model layer range via command-line parameters.

5

Section 05

Implications of Research Findings

Contribution to Model Interpretability

Provides new tools for mechanistic interpretability, revealing that the internal concept organization of models may follow cross-model universal rules.

Implications for Model Alignment and Safety

If geometric similarity exists in feature spaces, universal monitoring and intervention methods can be developed (e.g., safety feature patterns from one model can be transferred to others).

Impact on Model Compression and Transfer

Feature space universality provides a theoretical basis for knowledge transfer, helping to design efficient transfer learning strategies and reduce training resources for new models.

6

Section 06

Technical Details and Usage Guide

The project codebase has a clear structure (main scripts, analysis notebooks, cloud auxiliary scripts, documentation), supports installation via conda/pip, and provides Windows configuration scripts. Reproduction example: Comparing the feature spaces of Pythia 70M and 160M models can be done via shell scripts, supporting custom batch size, maximum sequence length, and analysis layer range.

7

Section 07

Limitations and Future Directions

The current study mainly focuses on Pythia series models; future work needs to expand to more architectures (e.g., Transformer variants, state space models). Activation correlation pairing may miss feature correspondences that are semantically related but have different activation patterns. The codebase is being refactored, and more complete documentation and user experience will be available in the future.

8

Section 08

Conclusion

This study is an important step toward understanding the internal world of LLMs. By quantifying feature space universality, it not only provides technical tools but also offers a new perspective: seemingly independent large models may collectively approach the deep truths of language and intelligence.