Zing Forum

Reading

MoRFI: Using Sparse Autoencoders to Identify the 'Culprit' Behind Large Model Hallucinations

Large models tend to hallucinate when fine-tuned on new knowledge, but the underlying mechanism has long been unclear. The MoRFI method uses sparse autoencoders to analyze residual stream activations, identifies latent directions causally related to hallucinations, and can restore knowledge retrieval ability through single-dimensional intervention.

大语言模型幻觉问题稀疏自编码器可解释性模型编辑知识检索
Published 2026-04-30 00:32Recent activity 2026-04-30 10:33Estimated read 5 min
MoRFI: Using Sparse Autoencoders to Identify the 'Culprit' Behind Large Model Hallucinations
1

Section 01

MoRFI: A New Method to Locate the Neural Mechanism of Large Model Hallucinations

Core Idea: Large models are prone to hallucinations when fine-tuned on new knowledge, and the mechanism has long been unclear. The MoRFI method uses sparse autoencoders to analyze residual stream activations, identifies latent directions causally related to hallucinations, and can restore the model's knowledge retrieval ability through single-dimensional intervention, providing a new path to mitigate hallucinations.

2

Section 02

Deep Mysteries of Large Model Hallucinations and Research Motivation

The hallucination problem of Large Language Models (LLMs) is a core obstacle restricting their applications, but the neural mechanism remains unclear. LLMs acquire factual knowledge during pre-training, but post-training (such as SFT, RLHF) with new knowledge tends to trigger hallucinations. Existing studies show that SFT exacerbates hallucinations, but the mechanism is not clear. The MoRFI research aims to locate the internal representation changes of hallucinations, and explore reversibility and repair methods without retraining.

3

Section 03

MoRFI Method and Experimental Design

Experimental Design: Three models—Llama3.1 8B, Gemma2 9B, and Mistral7B v0.3—were selected and fine-tuned on 7 closed-book QA datasets. By controlling the proportion of new knowledge and training epochs, it was found that the hallucination rate increases with the proportion of new knowledge and the number of training epochs.

Tools and Methods: Use Sparse Autoencoders (SAE) to decompose residual stream activations into sparse features; MoRFI filters SAE features that change monotonically with the proportion of new knowledge, and identifies latent directions causally related to hallucinations.

4

Section 04

Causal Intervention Verification: Effect of Single-Dimensional Repair on Hallucinations

Single latent variable intervention (adjusting the activation value of specific SAE features) on the latent directions identified by MoRFI can effectively restore the model's knowledge retrieval ability and reduce hallucinatory fabrications. This intervention is effective across different model architectures such as Llama, Gemma, and Mistral, showing universality.

5

Section 05

Significance and Implications of the MoRFI Study

  1. Provides a mechanistic explanation for hallucinations: Reveals that abnormal activation of specific latent directions is the neural basis of hallucinations;
  2. Demonstrates the potential of SAE in LLM interpretability, which can be extended to AI safety issues such as bias and toxicity;
  3. Suggests lightweight model editing solutions: Local intervention on specific features is more efficient than retraining.
6

Section 06

Limitations and Future Research Directions

Limitations: The experiments focus on closed-book QA tasks, with a single form of hallucination; the identified features are correlational rather than strictly causal; there is a lack of systematic repair strategies.

Future Directions: Extend to tasks such as open-ended generation and multi-turn dialogue; strengthen causal inference verification; explore repair solutions such as joint adjustment of multiple features and dynamic intervention thresholds.