Section 01
Introduction: Quantifying LLM Feature Space Universality Using Sparse Autoencoders
This study focuses on the geometric similarity of feature spaces across different large language models (LLMs). It decomposes the internal activation patterns of models into interpretable feature sets using sparse autoencoders (SAEs), pairs cross-model features via activation correlation, and quantifies feature space universality using methods like SVCCA and RSA. The research aims to reveal whether models of different architectures/scales share internal representation rules, providing new tools and perspectives for mechanistic interpretability, model alignment safety, and knowledge transfer.