Zing Forum

Reading

Dual-System Architecture: A Large Language Model Enhancement Scheme Without Modifying Base Model Weights

This article offers an in-depth analysis of the Dual-System Architecture project—an innovative "geometric sidecar" design that enhances frozen large language models (LLMs) by adding trainable modules. It enables uncensored generation and structured mathematical reasoning while keeping the base model weights completely unchanged, supporting multi-user isolation and continuous learning.

LLM边车架构无审查生成持续学习多用户隔离KV缓存压缩几何处理器纤维丛
Published 2026-04-01 07:11Recent activity 2026-04-01 07:48Estimated read 7 min
Dual-System Architecture: A Large Language Model Enhancement Scheme Without Modifying Base Model Weights
1

Section 01

Introduction: Dual-System Architecture—A New Paradigm for LLM Enhancement Without Modifying the Base Model

The Dual-System Architecture introduced in this article is an innovative LLM enhancement scheme. Its core lies in adding a "geometric sidecar" module to achieve functions like uncensored generation, structured mathematical reasoning, multi-user isolation, and continuous learning without modifying the base model weights. This architecture treats the frozen base model as "System 1" (fast intuition) and the sidecar module as "System 2" (slow reasoning). It supports independent training iterations, avoids the risk of model degradation caused by traditional fine-tuning, and is compatible with various mainstream LLM architectures (e.g., Qwen2.5-3B, Llama-3.1-8B, etc.).

2

Section 02

Background: Pain Points of Traditional LLM Enhancement Schemes and the Proposal of Dual-System

Traditional LLM enhancement usually relies on fine-tuning or continued pre-training, but it has problems such as high computational cost, difficulty in rollback, and potential damage to original capabilities. The Dual-System Architecture proposes a "geometric sidecar" design, which enhances frozen base models by adding trainable modules, solving the above pain points and providing a new path for LLM capability expansion.

3

Section 03

Methodology: Core Design and Technical Architecture of Dual-System

The core design of the Dual-System Architecture is the "System 1 + System 2" model: System 1 is the frozen base LLM, and System 2 is the geometric sidecar module. The sidecar module includes several key components:

  1. Diffusion Planner: Based on DDIM and adaptive layer normalization, it converts input tokens into high-dimensional latent planning representations;
  2. Geometric Processor: A 4-layer Transformer architecture that performs geometric transformations on latent representations;
  3. Fiber Bundle: A per-user personalization mechanism based on principal fiber bundle theory, ensuring user isolation (cross-user cos_sim ≥ 0.9999);
  4. EBM Judge: Based on an energy-based model, it identifies factual hallucinations and style mismatches;
  5. Cognitive Router: Routes gradients via the Kappa gating mechanism to mitigate catastrophic forgetting in continuous learning.
4

Section 04

Evidence: Experimental Validation of Key Capabilities and Performance Optimization

Validation of Key Capabilities:

  • Uncensored Generation: Using the FailSpy differential mean method to extract and project the "rejection direction", the rejection rate dropped from about 80% to 0%, while the difference between 6 benchmark tests (ARC-E, ARC-C, etc.) and the baseline was ≤0.3 percentage points;
  • Multi-User Isolation: The cos_sim of output changes from per-user style correction was 0.997, and cross-user isolation cos_sim ≥0.9999;
  • Continuous Learning: The EBM Judge performs token-level credit assignment, the Cognitive Router automatically routes gradients, and the BCH integrator merges cross-session perturbations;
  • TurboKV Cache Compression: In 4-bit mode, memory usage was reduced by 3.9x (from 896MB to 232MB for 8K context).
5

Section 05

Deployment and Tools: Hardware Support, Web Dashboard, and API Services

Hardware and Deployment Support:

  • Hardware Performance: When running Qwen2.5-3B-Instruct on RTX4060Ti, the peak memory usage is 3.4GB, and the generation speed is 36 tokens per second; dual-GPU sharding deployment is supported;
  • Web Dashboard: Provides functions such as neural terminal, tensor telemetry, generation control, feedback loop, and checkpoint management;
  • API Services: OpenAI-compatible API server that supports streaming/non-streaming generation and feedback endpoints, enabling multi-concurrent inference and continuous learning updates with exclusive GPU access.
6

Section 06

Open Source Ecosystem: License, Pre-trained Resources, and Extensibility

Open Source Ecosystem:

  • Open-sourced under the Apache 2.0 license, including training pipelines, benchmark tools, and unit tests;
  • Pre-trained ablation models and sidecar checkpoints have been uploaded to HuggingFace;
  • Includes the M-A-K-E-R multi-role audit framework for autonomous security analysis of decentralized protocols.
7

Section 07

Conclusion and Outlook: Significance and Application Value of the Dual-System Architecture

The Dual-System Architecture represents a new paradigm for LLM enhancement: it achieves capability expansion through mathematically rigorous附加 modules without modifying the base model. Its advantages include reducing experimental iteration costs, supporting multi-tenant deployment, continuous learning, and personalized services. For researchers and developers focusing on local AI deployment, model security, and efficient inference, this project has significant exploration value.