Zing Forum

Reading

The Impact of Sampling Temperature on Hallucination in RAG Systems: A Systematic Empirical Study

This undergraduate thesis research delves into how the sampling temperature parameter affects hallucinations in large language models (LLMs) within Retrieval-Augmented Generation (RAG) systems. Through a complete experimental framework, evaluation scripts, and statistical analysis, it provides empirical evidence for understanding the factual reliability of LLMs.

RAGHallucinationSampling TemperatureLLMResearchMeta-LlamaEvaluationRetrieval-Augmented GenerationAcademic StudyReproducibility
Published 2026-03-28 15:44Recent activity 2026-03-28 15:53Estimated read 6 min
The Impact of Sampling Temperature on Hallucination in RAG Systems: A Systematic Empirical Study
1

Section 01

Introduction: A Systematic Empirical Study on the Impact of Sampling Temperature on Hallucinations in RAG Systems

This study focuses on the impact of the sampling temperature parameter on hallucinations in large language models (LLMs) within Retrieval-Augmented Generation (RAG) systems. By constructing a complete experimental framework and conducting empirical analysis using the Meta-Llama-3.1-8B-Instruct model, it aims to provide data support for understanding the factual reliability of LLMs and optimizing model configurations in production environments. The research covers data preparation, RAG pipeline, evaluation scripts, statistical analysis, and other links, emphasizing reproducibility and a pragmatic orientation.

2

Section 02

Research Background: The Hallucination Dilemma of RAG Technology and the Key Role of Sampling Temperature

Retrieval-Augmented Generation (RAG) technology was originally regarded as an effective means to mitigate LLM hallucinations, but hallucination issues still exist in practical applications. As a key parameter controlling the randomness of model outputs, low temperature tends to produce deterministic outputs, while high temperature increases diversity but may deviate from facts. Understanding its impact on RAG hallucinations is of great practical significance for optimizing model configurations.

3

Section 03

Research Framework and Core Hypotheses: Exploring the Relationship Between Temperature and Hallucinations

Core research question: How does the change in sampling temperature affect the frequency and severity of hallucinations in RAG systems? Based on theory, the following hypotheses are proposed: 1. Temperature is positively correlated with hallucination rate; 2. There exists an optimal temperature range that balances creativity and factuality; 3. Different types of hallucinations have different sensitivities to temperature. The study uses the Meta-Llama-3.1-8B-Instruct model and ensures reproducibility through local deployment.

4

Section 04

Experimental Design: Scientific and Rigorous Methodology and Evaluation System

The experimental design includes: 1. Dataset: A test set of 500 questions covering different difficulty levels and types; 2. RAG pipeline: Document corpus, retrieval component, context assembly, generation component; 3. Temperature settings: Covering the range from 0.1 to over 1.5; 4. Evaluation metrics: Hallucination detection, factual accuracy, answer relevance, and statistical significance analysis.

5

Section 05

Technical Implementation and Reproducibility: Open-Source Framework and Statistical Analysis

Technical implementation details: The Q4_K_M quantized version of Meta-Llama-3.1-8B-Instruct is selected (balancing performance and efficiency); an automated evaluation pipeline is built to support batch experiments, metric calculation, and visualization; regression analysis (linear regression, analysis of variance, etc.) is used to quantify the relationship between temperature and hallucination rate. The code repository has a clear structure, making it easy to reproduce.

6

Section 06

Research Significance: Providing Empirical Basis for RAG System Configuration Optimization

Research significance: 1. Configuration optimization: If the positive correlation between temperature and hallucination is confirmed, lower temperatures (0.3-0.5) can be used in production environments to ensure factual accuracy; 2. Trade-off awareness: Reminding developers to balance creativity and factuality; 3. Evaluation standards: Providing a multi-dimensional evaluation template, focusing on the fidelity between generated content and retrieval sources.

7

Section 07

Limitations and Future Directions: Possible Paths for Extended Research

Limitations: Small model size (8B), only using Llama 3.1, focusing on specific task types, and quantization may introduce information loss. Future directions: Multi-model comparison, larger datasets, exploration of different RAG configurations, and combining manual evaluation to verify automatic metrics.