Reading

Aegis: An LLM Intelligent Routing and Hallucination Detection Gateway Based on Causal Inference

Aegis is a production-grade LLM gateway that automatically routes prompts to the most cost-effective model via a complexity classifier and uses causal inference technology to detect hallucinations without requiring ground truth labels. The system integrates semantic caching, multi-level risk detection, and real-time cost monitoring, providing a safe and cost-effective LLM invocation solution for high-stakes scenarios.

LLM网关因果推断幻觉检测智能路由成本优化语义缓存生产系统DoWhy安全网关

Published 2026-04-10 04:39Recent activity 2026-04-10 04:51Estimated read 8 min

Section 01

Introduction / Main Floor: Aegis: An LLM Intelligent Routing and Hallucination Detection Gateway Based on Causal Inference

Section 02

Dual Challenges of LLM Applications in Production Environments

When enterprises use large language models in production environments, they face two intertwined challenges. First is cost waste: simple queries are often routed to high-end models like GPT-4o, while in reality, Llama 3.1 (free) or Gemini Flash ($0.075 per 1M tokens) can provide answers of the same quality. Second is silent hallucination: LLMs produce confident, fluent but incorrect answers, which can lead to serious consequences in high-risk scenarios such as healthcare, law, and finance.

The Aegis project is designed to address these two issues. It is an end-to-end production system that not only implements intelligent routing but also, more importantly, introduces causal inference technology to detect hallucinations—without requiring ground truth labels as a reference.

Section 03

Core Architecture: More Than Just Routing

Routing solutions on the market (such as OpenRouter, LiteLLM) are already quite mature, and by 2026 this will become a commoditized feature. Aegis's differentiating value lies in its causal hallucination detection mechanism.

Traditional fact-checking methods require knowing the "correct answer" to judge whether the model output is accurate, but in production environments, we often do not have such references. Aegis uses an ingenious causal question: if only the wording of the question is changed, will the factual statement change?

If the model gives different answers to different formulations of the same question, this is a causal signal—the statement is not based on knowledge but on the surface features of the prompt. This is called a do(X) intervention in causal inference. This method does not require labels, ground truth answers, or external knowledge bases.

Section 04

Five-Level Routing and Cost Optimization

Aegis implements a five-level model routing based on complexity scoring:

Level	Model	Cost per 1M tokens	Applicable Scenarios
Free	Llama 3.1 8B (local Ollama)	$0.00	Simple fact queries, conversations
Economic	Gemini 1.5 Flash	$0.075	Low-medium complexity
Standard	GPT-4o-mini	$0.150	Medium complexity
Premium	Claude 3.5 Haiku	$0.250	Medium-high complexity, requires detailed reasoning
High-end	GPT-4o	$2.500	Complex reasoning, high-risk scenarios

The complexity classifier uses a four-factor weighted score: semantic embedding norm (30%), text structure score (25%), question type score (25%), and domain keyword density (20%). The score ranges from 0.0 to 1.0, automatically routing to the most cost-effective capable model.

For the legal, medical, and financial domains, the system implements a hard gateway: regardless of the complexity score, GPT-4o is mandatory, and this rule cannot be overridden by the classifier.

Section 05

Level 1: Hedge Phrase Detection (Free, Full-Scale Operation)

The system scans responses for 25 confidence-weakening phrases, such as "I'm not sure", "I think", "maybe", "as far as I know", etc. Detecting 3 or more marks it as a potential hallucination (medium risk). This method is zero-cost and is executed for all requests from all providers.

Section 06

Level 3: Rewrite Variance Detection (Conditionally Triggered)

When the query belongs to the legal/medical/financial domain, or the complexity score exceeds 0.7, the system triggers deep detection:

Use GPT-4o-mini to generate two different formulations of the same question
Send three versions of the question (original + two rewrites) to the target model in parallel
Calculate the average cosine similarity of the embedding vectors of the three responses
Variance = 1 - average similarity

If the variance exceeds the threshold θ=0.35, it is marked as a high-risk hallucination. This threshold is not arbitrarily set; it is calibrated offline via the DoWhy library and confirmed for its causal rationality through placebo treatment refutation tests.

Section 07

Risk Level Merging

The final risk level takes the larger value between domain risk and detection risk: the legal/medical domain is naturally high risk, the financial domain is medium risk; rewrite variance detection triggers high risk, and hedge phrase detection triggers medium risk.

Section 08

Semantic Caching: Zero-Cost Hits

Aegis implements an in-memory cache based on sentence-transformers/all-MiniLM-L6-v2. The threshold is set to 0.85 (instead of 0.95, which has a hit rate below 1% in practice). When the cache is hit, the response time is about 5 milliseconds, with zero cost.

The embedding model instance is shared between the cache and the hallucination detector, avoiding repeated loading of approximately 90MB of model weights. The cache resets with server restart (a design choice for the demo environment).

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15