Reading

KTransformers: A New Paradigm for Large Model Inference and Fine-Tuning via Heterogeneous Computing

The KTransformers framework, jointly launched by Tsinghua MADSys Lab and Approaching.AI, enables running trillion-parameter MoE large models on consumer-grade hardware through a CPU-GPU heterogeneous computing architecture, providing a brand-new solution for edge AI and local deployment.

KTransformers异构计算MoE大模型推理LLaMA-Factory边缘AI清华MADSysCPU-GPU混合量化推理本地部署

Published 2026-04-24 23:45Recent activity 2026-04-24 23:50Estimated read 6 min

KTransformers: A New Paradigm for Large Model Inference and Fine-Tuning via Heterogeneous Computing

Section 01

[Overview] KTransformers: Heterogeneous Computing Unlocks New Possibilities for Local Large Model Deployment

The KTransformers framework, jointly launched by Tsinghua MADSys Lab and Approaching.AI, breaks through the bottleneck of running trillion-parameter MoE large models on consumer-grade hardware via a CPU-GPU heterogeneous computing architecture, providing an efficient solution for edge AI and local deployment. This open-source framework includes two core modules: kt-kernel (heterogeneous inference kernel) and kt-sft (fine-tuning framework), which lower the hardware threshold for large model inference and fine-tuning and have become a notable project in the edge AI field.

Section 02

Background: Hardware Bottlenecks in Large Model Deployment and Optimization Potential of MoE

As the parameter scale of large language models exceeds one trillion (e.g., DeepSeek-V3 with MoE architecture), traditional deployment requires expensive multi-card A100/H100 clusters, which are unaffordable for most developers and small-to-medium enterprises. However, MoE models only activate part of the expert networks during each forward propagation, leaving huge room for computational optimization in theory. How to unleash their potential on consumer-grade hardware has become a key challenge in AI engineering.

Section 03

Core Approach: KTransformers' Heterogeneous Computing Architecture and Two Core Modules

KTransformers adopts a CPU-GPU heterogeneous scheduling strategy: hot experts reside on the GPU to ensure low latency, while cold experts are offloaded to the CPU and accelerated via Intel AMX/AVX512. It dynamically adjusts expert distribution to achieve load balancing. The framework includes two core modules:

kt-kernel: Supports mixed quantization (INT4/INT8 on CPU, GPTQ on GPU), and MoE-specific optimizations (NUMA-aware memory management, expert parallelism);
kt-sft: Integrated with LLaMA-Factory, it can complete full LoRA fine-tuning of a 671B parameter model with only 70GB GPU memory + 1.3TB RAM, supporting multi-GPU parallelism.

Section 04

Performance Evidence: Measured Data and Model Support Capabilities

Inference Performance

Model Configuration	Hardware Environment	Total Throughput	Output Throughput
DeepSeek-R1-0528 (FP8)	8×L20 GPU + Xeon Gold 6454S	227.85 tokens/s	87.58 tokens/s (8 concurrent)

Fine-Tuning Performance

Model	Configuration	Throughput	GPU Memory Usage
DeepSeek-V3 (671B)	LoRA + AMX	~40 tokens/s	70GB (multi-card)
DeepSeek-V2-Lite (14B)	LoRA + AMX	~530 tokens/s	6GB

Day0 Supported Models

Quickly adapts to the latest models such as Kimi-K2.5, GLM-5, MiniMax-M2.5, Qwen3-Next, ensuring users experience new technologies at the earliest possible time.

Section 05

Application Scenarios and Ecosystem: Edge AI, Scientific Research & Teaching, and Cross-Hardware Support

Application Scenarios

Edge AI: Process sensitive data locally (medical, financial) without data leaving the domain;
Scientific Research & Teaching: Lower the hardware threshold for large model research in universities;
Prototype Verification: Quickly verify models locally to shorten the development cycle.

Ecosystem and Hardware Expansion

Integrated with the SGLang inference engine to provide production-grade deployment solutions;
Supports cross-platform hardware such as NVIDIA GPU, Intel Arc GPU, AMD ROCm, and Huawei Ascend NPU.

Section 06

Conclusion and Recommendations: Heterogeneous Optimization is a Key Direction for Large Model Engineering

KTransformers represents the shift of large model engineering from 'hardware stacking' to 'architecture optimization', proving that consumer-grade hardware can handle trillion-parameter models. For AI developers pursuing data privacy, cost control, and response speed, KTransformers is a technical stack worth exploring. As the demand for edge AI grows, the idea of heterogeneous optimization may become an industry standard.

Project URL: https://github.com/kvcache-ai/ktransformers Official Documentation: https://kvcache-ai.github.io/ktransformers/