# Quantifying Cross-Large Language Model Feature Space Universality Using Sparse Autoencoders

> A cutting-edge study on the geometric similarity of feature spaces from sparse autoencoders of different large language models, which pairs features via activation correlation and measures the relational similarity of the geometry of decoder weights.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-13T23:54:28.000Z
- 最近活动: 2026-05-13T23:59:24.971Z
- 热度: 159.9
- 关键词: sparse autoencoder, feature space, large language models, mechanistic interpretability, SVCCA, RSA, cross-model alignment, neural network interpretability
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-wlg1-univ-feat-geom
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-wlg1-univ-feat-geom
- Markdown 来源: floors_fallback

---

## Introduction: Quantifying LLM Feature Space Universality Using Sparse Autoencoders

This study focuses on the geometric similarity of feature spaces across different large language models (LLMs). It decomposes the internal activation patterns of models into interpretable feature sets using sparse autoencoders (SAEs), pairs cross-model features via activation correlation, and quantifies feature space universality using methods like SVCCA and RSA. The research aims to reveal whether models of different architectures/scales share internal representation rules, providing new tools and perspectives for mechanistic interpretability, model alignment safety, and knowledge transfer.

## Research Background and Motivation

With the rapid development of LLMs, a core question arises: Do models of different architectures/scales learn similar internal representations? Traditionally, it was thought that models independently discover language rules, but increasing evidence suggests they may share a 'universal language'. Sparse autoencoders (SAEs) can decompose activation patterns into semantic features, but cross-model features may not align directly. The innovation of this study: Does the geometric structure of feature spaces still exhibit similarity even if individual features cannot be directly aligned?

## Core Methodology

### Activation Correlation Method for Feature Pairing
Pairing SAE features from different models via activation correlation: Features with similar activation patterns on the same input text are considered potential correspondences, which is more flexible than label matching.

### Relational Similarity Metrics
- **SVCCA**: Compares the canonical correlation of feature decoder weights to quantify spatial alignment;
- **RSA**: Calculates the correlation of distance matrices between features to capture the similarity of overall geometric structures;
- Baseline methods: Direct cosine similarity and other comparison benchmarks.

### Semantic Subspace Analysis
Explores changes in similarity across different semantic subspaces (e.g., differences in cross-model alignment between mathematical reasoning vs. sentiment analysis subspaces).

## Technical Implementation Highlights

### Interactive Feature Space Visualization
Provides the `pythia_feature_mapping_viz.py` script to generate self-contained HTML pages with two UMAP panels (corresponding to the SAE feature spaces of two models). The same batch of text is input into both models, features are mapped via batch activation correlation, decoder directions are dimensionality-reduced using UMAP, and users can hover/select to highlight corresponding features across panels.

### Experimental Framework and Reproducibility
Supports cross-layer/cross-scale comparisons of Pythia series models, with configuration of batch size, sequence length, number of random runs, and model layer range via command-line parameters.

## Implications of Research Findings

### Contribution to Model Interpretability
Provides new tools for mechanistic interpretability, revealing that the internal concept organization of models may follow cross-model universal rules.

### Implications for Model Alignment and Safety
If geometric similarity exists in feature spaces, universal monitoring and intervention methods can be developed (e.g., safety feature patterns from one model can be transferred to others).

### Impact on Model Compression and Transfer
Feature space universality provides a theoretical basis for knowledge transfer, helping to design efficient transfer learning strategies and reduce training resources for new models.

## Technical Details and Usage Guide

The project codebase has a clear structure (main scripts, analysis notebooks, cloud auxiliary scripts, documentation), supports installation via conda/pip, and provides Windows configuration scripts. Reproduction example: Comparing the feature spaces of Pythia 70M and 160M models can be done via shell scripts, supporting custom batch size, maximum sequence length, and analysis layer range.

## Limitations and Future Directions

The current study mainly focuses on Pythia series models; future work needs to expand to more architectures (e.g., Transformer variants, state space models). Activation correlation pairing may miss feature correspondences that are semantically related but have different activation patterns. The codebase is being refactored, and more complete documentation and user experience will be available in the future.

## Conclusion

This study is an important step toward understanding the internal world of LLMs. By quantifying feature space universality, it not only provides technical tools but also offers a new perspective: seemingly independent large models may collectively approach the deep truths of language and intelligence.
