Reading

Dual-System Architecture: A Large Language Model Enhancement Scheme Without Modifying Base Model Weights

This article offers an in-depth analysis of the Dual-System Architecture project—an innovative "geometric sidecar" design that enhances frozen large language models (LLMs) by adding trainable modules. It enables uncensored generation and structured mathematical reasoning while keeping the base model weights completely unchanged, supporting multi-user isolation and continuous learning.

LLM边车架构无审查生成持续学习多用户隔离KV缓存压缩几何处理器纤维丛

Published 2026-04-01 07:11Recent activity 2026-04-01 07:48Estimated read 7 min

Dual-System Architecture: A Large Language Model Enhancement Scheme Without Modifying Base Model Weights

Section 01

Introduction: Dual-System Architecture—A New Paradigm for LLM Enhancement Without Modifying the Base Model

The Dual-System Architecture introduced in this article is an innovative LLM enhancement scheme. Its core lies in adding a "geometric sidecar" module to achieve functions like uncensored generation, structured mathematical reasoning, multi-user isolation, and continuous learning without modifying the base model weights. This architecture treats the frozen base model as "System 1" (fast intuition) and the sidecar module as "System 2" (slow reasoning). It supports independent training iterations, avoids the risk of model degradation caused by traditional fine-tuning, and is compatible with various mainstream LLM architectures (e.g., Qwen2.5-3B, Llama-3.1-8B, etc.).

Section 02

Background: Pain Points of Traditional LLM Enhancement Schemes and the Proposal of Dual-System

Traditional LLM enhancement usually relies on fine-tuning or continued pre-training, but it has problems such as high computational cost, difficulty in rollback, and potential damage to original capabilities. The Dual-System Architecture proposes a "geometric sidecar" design, which enhances frozen base models by adding trainable modules, solving the above pain points and providing a new path for LLM capability expansion.

Section 03

Methodology: Core Design and Technical Architecture of Dual-System

The core design of the Dual-System Architecture is the "System 1 + System 2" model: System 1 is the frozen base LLM, and System 2 is the geometric sidecar module. The sidecar module includes several key components:

Diffusion Planner: Based on DDIM and adaptive layer normalization, it converts input tokens into high-dimensional latent planning representations;
Geometric Processor: A 4-layer Transformer architecture that performs geometric transformations on latent representations;
Fiber Bundle: A per-user personalization mechanism based on principal fiber bundle theory, ensuring user isolation (cross-user cos_sim ≥ 0.9999);
EBM Judge: Based on an energy-based model, it identifies factual hallucinations and style mismatches;
Cognitive Router: Routes gradients via the Kappa gating mechanism to mitigate catastrophic forgetting in continuous learning.

Section 04

Evidence: Experimental Validation of Key Capabilities and Performance Optimization

Validation of Key Capabilities:

Uncensored Generation: Using the FailSpy differential mean method to extract and project the "rejection direction", the rejection rate dropped from about 80% to 0%, while the difference between 6 benchmark tests (ARC-E, ARC-C, etc.) and the baseline was ≤0.3 percentage points;
Multi-User Isolation: The cos_sim of output changes from per-user style correction was 0.997, and cross-user isolation cos_sim ≥0.9999;
Continuous Learning: The EBM Judge performs token-level credit assignment, the Cognitive Router automatically routes gradients, and the BCH integrator merges cross-session perturbations;
TurboKV Cache Compression: In 4-bit mode, memory usage was reduced by 3.9x (from 896MB to 232MB for 8K context).

Section 05

Deployment and Tools: Hardware Support, Web Dashboard, and API Services

Hardware and Deployment Support:

Hardware Performance: When running Qwen2.5-3B-Instruct on RTX4060Ti, the peak memory usage is 3.4GB, and the generation speed is 36 tokens per second; dual-GPU sharding deployment is supported;
Web Dashboard: Provides functions such as neural terminal, tensor telemetry, generation control, feedback loop, and checkpoint management;
API Services: OpenAI-compatible API server that supports streaming/non-streaming generation and feedback endpoints, enabling multi-concurrent inference and continuous learning updates with exclusive GPU access.

Section 06

Open Source Ecosystem: License, Pre-trained Resources, and Extensibility

Open Source Ecosystem:

Open-sourced under the Apache 2.0 license, including training pipelines, benchmark tools, and unit tests;
Pre-trained ablation models and sidecar checkpoints have been uploaded to HuggingFace;
Includes the M-A-K-E-R multi-role audit framework for autonomous security analysis of decentralized protocols.

Section 07

Conclusion and Outlook: Significance and Application Value of the Dual-System Architecture

The Dual-System Architecture represents a new paradigm for LLM enhancement: it achieves capability expansion through mathematically rigorous附加 modules without modifying the base model. Its advantages include reducing experimental iteration costs, supporting multi-tenant deployment, continuous learning, and personalized services. For researchers and developers focusing on local AI deployment, model security, and efficient inference, this project has significant exploration value.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15