正文

kernel-set：统一C ABI的高性能LLM推理与训练CUDA内核库

kernel-set通过统一的C ABI封装78种LLM核心算子，支持Python、Rust、Go、TypeScript多语言调用，自动选择最优内核实现，为大规模语言模型推理和训练提供跨平台高性能计算解决方案。

CUDALLM推理GPU内核FlashAttentionGEMM量化多语言绑定高性能计算Transformer深度学习

发布时间 2026/06/05 20:14最近活动 2026/06/05 20:21预计阅读 8 分钟

章节 01

kernel-set: Unified C ABI High-Performance CUDA Kernel Library for LLM Inference & Training (Main Thread)

Core Overview

kernel-set is a high-performance CUDA kernel library for LLM inference and training, featuring:

Unified C ABI: Encapsulates 78 core LLM operators, abstracting diverse kernel implementations.
Multi-language Support: Binds to Python, Rust, Go, TypeScript for cross-language access.
Auto Optimal Selection: Smart dispatcher chooses best kernel based on GPU architecture, data type, and operator type.

Source Info

Author/Maintainer: cklxx
Platform: GitHub
Original Link: https://github.com/cklxx/kernel-set
Update Time: 2026-06-05

章节 02

Background & Motivation

In LLM inference/training, GPU kernel performance directly impacts efficiency. However, existing excellent kernels (FlashAttention, vLLM, DeepGEMM, etc.) have fragmented interfaces and optimization scenarios. Developers face:

Need to deeply understand each library to select optimal kernels.
Manual conditional code increases complexity and reduces portability.

kernel-set addresses this by unifying 78 core operators under a stable C ABI, enabling one API to access best-in-class implementations.

章节 03

Core Architecture & Key Features

Layered Architecture

Bottom Layer: Mix of self-developed clean-room kernels and third-party optimized implementations.
Middle Layer: Unified C ABI interface (stable, cross-platform).
Top Layer: Multi-language bindings (Python/Rust/Go/TypeScript) for easy integration.

78 Core Operators Coverage

Covers full LLM lifecycle:

Attention: FlashAttention-2, KV cache, MLA, GQA.
GEMM: Tensor core FP16/BF16, fused with bias/activation, quantized formats (W8A8, W4A16, FP8).
Normalization: RMSNorm (fused residual), LayerNorm (forward/backward).
Others: RoPE, SwiGLU, quantization (FP8/INT8/INT4), MoE components, sampling, SSM.

Smart Kernel Selection

Auto routes to optimal implementation based on GPU (sm70-sm120), data type, operator.
Transparent to developers: e.g., ks.dispatch.rms_norm(x,w) uses best available kernel.
Fallback: Self-developed clean-room kernels ensure correctness when no optimal option exists.

Multi-language Support

Precompiled libkernel_set.so (C ABI) allows bindings without CUDA setup.
Seamless integration with PyTorch (Python), zero-cost abstraction (Rust), etc.

章节 04

Technical Implementation Details

Memory-bound Operators Optimization

Focus on RMSNorm, SwiGLU, RoPE, AdamW (memory bandwidth bottleneck).
Self-developed kernels achieve 84-87% of A100 peak bandwidth (on par with FlashInfer/Liger).

Compute-bound Operators Strategy

For GEMM, Attention, MoE (compute-intensive), route to industry-leading libs (cuBLAS, FlashAttention, DeepGEMM).
Self-developed kernels act as fallback for portability.

Build System & Hardware Support

CMake 3.24+ & CUDA 12.x.
Supports sm70 (T4/V100) to sm120 (Blackwell).
Precompiled wheels (sm75-sm120) with static CUDA runtime (no extra compilation needed).

章节 05

Practical Application & Verification

Validation Methods

examples/eval_model.py hot-replaces kernel-set operators into HuggingFace models for bit-level comparison.

Key Results

Gemma-2-2B: Bit-level consistent output for 64 greedy decoding tokens.
Qwen2.5: 100% Top-1 correctness match.
Speedup: 3-9x faster than eager PyTorch for individual operators.

Tested GPUs

L4 (sm89), A100 (sm80), RTX PRO 6000 Blackwell (sm120).

章节 06

Ecosystem Integration & Toolchain

ksctl Command-line Tool

Generate optimal kernel config: python3 models/ksctl plan --model deepseek-v3 --gpu h100 --dtype fp8.

Model Kernel Mappings

Maintains mappings for 157 mainstream models (DeepSeek-V4, GLM-5, Kimi-2.6, Gemma-4, Llama4) to required kernels.

Documentation

Detailed guides: optimal selection mechanism, routing tables, quantization.
Full catalog: 127 logical operators, 476 atomic operators.

章节 07

Open Source License & Contribution

License

kernel-set uses Apache-2.0 license (commercial-friendly).
Self-developed kernels: clean-room implementation.
Third-party code: retains original licenses (see THIRD_PARTY_NOTICES.md).

Contribution

Open to community contributions while respecting upstream intellectual property.

章节 08

Summary & Future Outlook

Core Value

kernel-set provides:

Simplification: One API for 78 operators (no need to learn multiple libraries).
Performance: Auto optimal selection; memory-bound operators reach 84-87% peak bandwidth.
Portability: Runs on T4 to Blackwell GPUs.
Multi-language: Python/Rust/Go/TypeScript support.

Outlook

As LLMs scale and hardware evolves, unified abstraction layers like kernel-set will become critical—letting developers focus on models/apps instead of kernel optimizations.