# MoRFI: Using Sparse Autoencoders to Identify the 'Culprit' Behind Large Model Hallucinations

> Large models tend to hallucinate when fine-tuned on new knowledge, but the underlying mechanism has long been unclear. The MoRFI method uses sparse autoencoders to analyze residual stream activations, identifies latent directions causally related to hallucinations, and can restore knowledge retrieval ability through single-dimensional intervention.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-29T16:32:57.000Z
- 最近活动: 2026-04-30T02:33:09.009Z
- 热度: 128.0
- 关键词: 大语言模型, 幻觉问题, 稀疏自编码器, 可解释性, 模型编辑, 知识检索
- 页面链接: https://www.zingnex.cn/en/forum/thread/morfi
- Canonical: https://www.zingnex.cn/forum/thread/morfi
- Markdown 来源: floors_fallback

---

## MoRFI: A New Method to Locate the Neural Mechanism of Large Model Hallucinations

Core Idea: Large models are prone to hallucinations when fine-tuned on new knowledge, and the mechanism has long been unclear. The MoRFI method uses sparse autoencoders to analyze residual stream activations, identifies latent directions causally related to hallucinations, and can restore the model's knowledge retrieval ability through single-dimensional intervention, providing a new path to mitigate hallucinations.

## Deep Mysteries of Large Model Hallucinations and Research Motivation

The hallucination problem of Large Language Models (LLMs) is a core obstacle restricting their applications, but the neural mechanism remains unclear. LLMs acquire factual knowledge during pre-training, but post-training (such as SFT, RLHF) with new knowledge tends to trigger hallucinations. Existing studies show that SFT exacerbates hallucinations, but the mechanism is not clear. The MoRFI research aims to locate the internal representation changes of hallucinations, and explore reversibility and repair methods without retraining.

## MoRFI Method and Experimental Design

**Experimental Design**: Three models—Llama3.1 8B, Gemma2 9B, and Mistral7B v0.3—were selected and fine-tuned on 7 closed-book QA datasets. By controlling the proportion of new knowledge and training epochs, it was found that the hallucination rate increases with the proportion of new knowledge and the number of training epochs.

**Tools and Methods**: Use Sparse Autoencoders (SAE) to decompose residual stream activations into sparse features; MoRFI filters SAE features that change monotonically with the proportion of new knowledge, and identifies latent directions causally related to hallucinations.

## Causal Intervention Verification: Effect of Single-Dimensional Repair on Hallucinations

Single latent variable intervention (adjusting the activation value of specific SAE features) on the latent directions identified by MoRFI can effectively restore the model's knowledge retrieval ability and reduce hallucinatory fabrications. This intervention is effective across different model architectures such as Llama, Gemma, and Mistral, showing universality.

## Significance and Implications of the MoRFI Study

1. Provides a mechanistic explanation for hallucinations: Reveals that abnormal activation of specific latent directions is the neural basis of hallucinations;
2. Demonstrates the potential of SAE in LLM interpretability, which can be extended to AI safety issues such as bias and toxicity;
3. Suggests lightweight model editing solutions: Local intervention on specific features is more efficient than retraining.

## Limitations and Future Research Directions

**Limitations**: The experiments focus on closed-book QA tasks, with a single form of hallucination; the identified features are correlational rather than strictly causal; there is a lack of systematic repair strategies.

**Future Directions**: Extend to tasks such as open-ended generation and multi-turn dialogue; strengthen causal inference verification; explore repair solutions such as joint adjustment of multiple features and dynamic intervention thresholds.
