# Evidence Alignment Measurement: Evaluating the Fact Anchoring Capability of Large Language Models

> An open-source project that studies how the parameter scale of large language models (from 8B to 405B) affects fact anchoring capability, proposing the Evidence Alignment Score (EAS) as a hybrid evaluation metric.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-15T16:09:24.000Z
- 最近活动: 2026-04-15T16:21:00.640Z
- 热度: 150.8
- 关键词: 证据对齐, 大语言模型, 幻觉问题, FEVER基准, NLI蕴含, 语义相似度, 事实锚定, 模型评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-nipun2411-evidence-misligment-large-language-models
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-nipun2411-evidence-misligment-large-language-models
- Markdown 来源: floors_fallback

---

## Introduction: Evidence Misalignment Project—An Open-Source Framework for Quantifying LLM Fact Anchoring Capability

This open-source project addresses the "hallucination" issue of large language models (LLMs), studying the impact of parameter scale (from 8B to 405B) on fact anchoring capability. It proposes the Evidence Alignment Score (EAS) as a hybrid evaluation metric, supports multi-backend model evaluation (local Ollama, cloud-based NVIDIA NIM/OpenAI), uses the FEVER benchmark dataset and rigorous processes, and provides a systematic, reproducible framework for LLM factuality evaluation.

## Background: The "Hallucination" Dilemma of LLMs and Research Questions

The "hallucination" problem—where LLMs generate content inconsistent with facts—is a key challenge in the AI field. As model scales expand from 8 billion to 405 billion parameters, the core question is: Are larger models better at aligning generated content with evidence? The open-source Evidence Misalignment project on GitHub provides a systematic evaluation framework for this purpose.

## Methodology: Design of the Evidence Alignment Score (EAS)

EAS is a hybrid metric that quantifies the alignment between LLM-generated claims and evidence, composed of two weighted components:

### Semantic Similarity (weight α=0.35)
- Model: all-MiniLM-L6-v2
- Metric: Cosine similarity between the embeddings of the claim and evidence

### NLI Entailment (weight β=0.65)
- Model: cross-encoder/nli-deberta-v3-base
- Metric: Probability that the evidence entails the claim

Formula: `EAS = α × semantic_score + β × entailment_score`

Alignment Levels:
- Aligned: EAS ≥ 0.70
- Partially Aligned: 0.40 ≤ EAS < 0.70
- Misaligned: EAS < 0.40

## Evaluation Dataset: FEVER Benchmark and Balanced Sampling

The FEVER (Fact Extraction and VERification) authoritative fact-checking dataset is used, which includes claims labeled with SUPPORTS and REFUTES. To ensure fairness, a balanced sampling strategy is adopted: equal numbers of SUPPORTS and REFUTES samples to avoid class bias.

## Evaluation Implementation: Multi-Backend Support and Rigorous Processes

#### Multi-Backend Support
| Backend | Trigger Condition | Example Models |
|---------|-------------------|----------------|
| Ollama (Local) | Name does not contain "/" | llama3, mistral, llama3.1:8b, qwen2:7b |
| NVIDIA NIM (Cloud) | Name contains "/" | meta/llama-3.1-8b-instruct, meta/llama-3.1-405b-instruct |
| OpenAI (Cloud) | Name starts with "gpt" | gpt-4o, gpt-4o-mini |

#### Evaluation Model Scales
Covers 8B to 405B parameters: Llama3.1 8B/70B/405B, Mixtral8x7B, GPT-4o series

#### Rigorous Processes
- Fixed random seed (seed=42): Evaluate 300 identical samples
- Rate limiting: 2-second throttling + exponential backoff strategy for NVIDIA NIM
- Local scoring: EAS calculation (semantic + NLI) runs locally to ensure consistency
- Directory compatibility: Replace colons in model labels with hyphens (e.g., llama3.1:8b → llama3.1-8b)

## Application Scenarios and Research Insights

#### Application Scenarios
- Model selection: Evaluate fact anchoring capability to assist decision-making
- RAG system optimization: Assess alignment between generator and retrieved evidence
- Hallucination detection: Automatically identify factual errors
- Academic research: Standardized evaluation tool

#### Research Insights
- Re-examining scaling laws: EAS quantifies the relationship between model scale and alignment capability
- Need for multi-dimensional evaluation: A single metric is insufficient; semantic + NLI is more comprehensive
- Reproducibility engineering: Fixed seeds, balanced sampling, etc., are best practices for rigorous evaluation

## Technical Highlights and Conclusion

#### Technical Highlights
Core components of the modular architecture: data_loader.py (dataset loading), claim_segmenter.py (sentence segmentation), evidence_retriever.py (evidence extraction), semantic_scorer.py (cosine similarity), nli_scorer.py (NLI scoring), eas_calculator.py (EAS calculation), llm_client.py (multi-backend client)

#### Conclusion
This project provides a systematic and reproducible evaluation framework for LLM fact anchoring capability. The EAS metric quantifies alignment degree, helping to explore the relationship between model scale and factuality. It is of great value to developers and researchers working on AI credibility, hallucination issues, and RAG optimization, and is a key tool for moving toward more trustworthy AI systems.
