Zing Forum

Reading

Stripping Lexical Interference: AIPsy-Affect Provides a Pure Experimental Ground for Emotional Interpretability Research of Language Models

This article introduces AIPsy-Affect, a stimulus dataset containing 480 keyword-free situational narratives. Through a matched neutral control group design, it helps researchers distinguish between language models' understanding of emotional concepts and their superficial recognition of emotional vocabulary.

机械可解释性情感分析语言模型稀疏自编码器激活修补实验设计AI安全认知科学神经探针
Published 2026-04-26 22:03Recent activity 2026-04-28 10:25Estimated read 5 min
Stripping Lexical Interference: AIPsy-Affect Provides a Pure Experimental Ground for Emotional Interpretability Research of Language Models
1

Section 01

Stripping Lexical Interference: AIPsy-Affect Provides a Pure Experimental Ground for Emotional Interpretability of Language Models

This article introduces the AIPsy-Affect dataset, which contains 480 keyword-free situational narratives. Through a matched neutral control group design, it helps researchers distinguish between language models' understanding of emotional concepts and their superficial recognition of emotional vocabulary, addressing the methodological dilemmas in emotional interpretability research.

2

Section 02

Methodological Dilemmas in Emotional Interpretability Research and the Problem of Lexical Confusion

Current emotional research commonly uses text stimuli containing explicit emotional vocabulary, leading to confounding variables: it is impossible to determine whether model activation stems from an understanding of emotional concepts or superficial recognition of vocabulary. Existing control conditions often only replace vocabulary without maintaining situational consistency, still failing to eliminate lexical confusion. This issue affects the value of basic research and is directly related to AI safety—conclusions based on flawed designs may lead to incorrect safety strategies.

3

Section 03

Core Design and Methodological Guarantees of the AIPsy-Affect Dataset

AIPsy-Affect includes 192 emotion-evoking scenarios (covering 8 basic emotions, no direct emotional vocabulary) and 192 matched neutral controls (maintaining structures like characters and scenes while removing emotional content), as well as intensity stratification and cross-emotion testing. Three NLP defense verifications: no significant differences in bag-of-words analysis, emotional dictionaries cannot distinguish, and context classifiers can detect emotions but not identify categories—ensuring the purity of stimuli.

4

Section 04

Application Scenarios of AIPsy-Affect

The dataset supports various interpretability studies: linear probe analysis (testing emotional representations at all levels), activation patching experiments (identifying emotion-carrying neurons/directions), sparse autoencoder feature analysis (finding features encoding emotional concepts), and causal ablation & steering vectors (establishing causal links between features and functions).

5

Section 05

Comparison and Extension of AIPsy-Affect with Previous Work

AIPsy-Affect is a four-fold expansion of the team's previous 96-stimulus dataset, enhancing statistical power and supporting cross-emotion comparisons. Compared to other emotional datasets, its uniqueness lies in its rigorous control design, filling a methodological gap.

6

Section 06

Open Science and Community Value

AIPsy-Affect is open-sourced under the MIT license, promoting methodological standardization (benchmark test set), lowering research barriers (no need to construct complex stimuli), and facilitating discoveries (large-scale design reveals overlooked patterns).

7

Section 07

Conclusion: Towards a More Rigorous Science of Interpretability

AIPsy-Affect represents a step towards the maturity of methodological approaches in AI interpretability research, emphasizing the importance of rigorous experimental design. It helps researchers strip away superficial confusion and touch on deep cognitive mechanisms, serving as a necessary foundation for building trustworthy AI systems.