# From Grammar to Emotion: A Mechanistic Analysis of Emotional Reasoning in Large Language Models

> This paper systematically analyzes the emotion recognition mechanism of Large Language Models (LLMs) using Sparse Autoencoders (SAEs), discovers a three-stage information flow pattern, and proposes a causal feature guidance method that significantly improves emotion recognition performance while preserving language modeling capabilities.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-28T17:03:16.000Z
- 最近活动: 2026-04-29T02:43:38.788Z
- 热度: 132.3
- 关键词: 机械解释性, 稀疏自编码器, 情绪识别, 因果追踪, 特征引导, LLM可解释性, 人机交互, 情绪AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2604-25866v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2604-25866v1
- Markdown 来源: floors_fallback

---

## [Main Post/Introduction] Mechanistic Analysis and Optimization of Emotional Reasoning Mechanisms in LLMs

This paper systematically analyzes the emotion recognition mechanism of Large Language Models (LLMs) using Sparse Autoencoders (SAEs), discovers a three-stage information flow pattern, and proposes a causal feature guidance method that significantly improves emotion recognition performance while preserving language modeling capabilities.

## Research Background: The Interpretability Gap in Emotional AI

Large Language Models (LLMs) are increasingly deployed in emotion-sensitive human-computer interaction scenarios—from mental health counseling assistants and customer service chatbots to educational tutoring systems. These applications require models not only to understand literal semantics but also to accurately capture and respond to human emotions. However, despite the critical importance of emotion recognition capabilities for the practical application of LLMs, we know little about their internal working mechanisms. How do models infer emotional states from pure text input? How does this capability emerge through the layer-by-layer computations of neural networks? The answers to these questions are essential for building safer, more controllable, and more trustworthy emotional AI systems.

## Methodology: Analytical Framework Using Sparse Autoencoders

This study uses Sparse Autoencoders (SAEs) as the main analytical tool. SAEs can learn to decompose neural network activations into sparse, interpretable feature sets, providing unprecedented transparency for understanding the internal representations of complex models. The research team constructed a systematic analysis process: 1. Cross-layer activation tracking: Record and analyze the sparse feature activation patterns of each layer of the model; 2. Staged information flow analysis: Identify the flow rules of emotion-related information in the depth direction of the model; 3. Causal tracing: Quantify the contribution of specific features to emotion prediction through interventions on them; 4. Feature manipulation: Develop interpretable feature guidance methods based on causal insights.

## Key Finding 1: Three-Stage Information Flow Pattern

Through detailed analysis of sparse feature activations, the study reveals a consistent three-stage information processing pattern: Stage 1 (Early Layers): Activation patterns mainly reflect basic syntactic and lexical processing, closely related to low-level language tasks such as part-of-speech tagging and syntactic structure parsing; Stage 2 (Middle Layers): Begin to show higher-level semantic integration features, with features related to tasks like entity relationship, coreference resolution, and semantic role labeling being active; Stage 3 (Late Layers): Significant emergence of emotion-related features, indicating that LLM's emotion understanding is centrally constructed based on high-level semantic representations. Engineering implication of this finding: Model compression or adaptation for emotion tasks may safely modify early layers without impairing core capabilities.

## Key Finding 2: Dual Structure of Emotional Representation

The study analyzes the internal structure of emotional representation and finds it consists of two complementary components: Shared Feature Pool: There is a set of basic features shared across emotion categories, which may encode general dimensions of emotion (such as valence and arousal), providing an underlying framework for emotion recognition; Specific Features: Each emotion has a unique subset of features that capture its distinct characteristics (e.g., "joy" is associated with positive vocabulary patterns, "anger" with semantic features related to conflict or frustration); Specificity of Disgust: The representation of disgust is more dispersed and weaker than other emotions, possibly reflecting the scarcity of disgust samples in training data or the fuzziness of concept boundaries.

## Key Contribution: Causal Feature Guidance Method

Based on mechanistic interpretability insights, the research team developed a causal feature guidance method: Method Design: The core idea is to directionally enhance the activation of features with strong causal impact on emotions to improve model performance; Key Advantages: Interpretability (each intervention has a clear causal basis), data efficiency (no need for large amounts of labeled data for fine-tuning), capability preservation (maintains the model's general language modeling capabilities); Experimental Results: Evaluated on multiple model architectures and emotion recognition datasets, it significantly improves emotion recognition accuracy while preserving the integrity of language modeling capabilities and cross-dataset generalization robustness.
