# kernel-set: Unified C ABI High-Performance CUDA Kernel Library for LLM Inference & Training

> kernel-set encapsulates 78 core LLM operators via a unified C ABI, supports multi-language calls in Python, Rust, Go, and TypeScript, automatically selects optimal kernel implementations, and provides a cross-platform high-performance computing solution for large-scale language model inference and training.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-05T12:14:49.000Z
- 最近活动: 2026-06-05T12:21:24.635Z
- 热度: 163.9
- 关键词: CUDA, LLM推理, GPU内核, FlashAttention, GEMM, 量化, 多语言绑定, 高性能计算, Transformer, 深度学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/kernel-set-c-abillmcuda
- Canonical: https://www.zingnex.cn/forum/thread/kernel-set-c-abillmcuda
- Markdown 来源: floors_fallback

---

## kernel-set: Unified C ABI High-Performance CUDA Kernel Library for LLM Inference & Training (Main Thread)

### Core Overview
kernel-set is a high-performance CUDA kernel library for LLM inference and training, featuring:
- **Unified C ABI**: Encapsulates 78 core LLM operators, abstracting diverse kernel implementations.
- **Multi-language Support**: Binds to Python, Rust, Go, TypeScript for cross-language access.
- **Auto Optimal Selection**: Smart dispatcher chooses best kernel based on GPU architecture, data type, and operator type.

### Source Info
- Author/Maintainer: cklxx
- Platform: GitHub
- Original Link: https://github.com/cklxx/kernel-set
- Update Time: 2026-06-05

## Background & Motivation

In LLM inference/training, GPU kernel performance directly impacts efficiency. However, existing excellent kernels (FlashAttention, vLLM, DeepGEMM, etc.) have fragmented interfaces and optimization scenarios. Developers face:
1. Need to deeply understand each library to select optimal kernels.
2. Manual conditional code increases complexity and reduces portability.

kernel-set addresses this by unifying 78 core operators under a stable C ABI, enabling one API to access best-in-class implementations.

## Core Architecture & Key Features

#### Layered Architecture
- **Bottom Layer**: Mix of self-developed clean-room kernels and third-party optimized implementations.
- **Middle Layer**: Unified C ABI interface (stable, cross-platform).
- **Top Layer**: Multi-language bindings (Python/Rust/Go/TypeScript) for easy integration.

#### 78 Core Operators Coverage
Covers full LLM lifecycle:
- Attention: FlashAttention-2, KV cache, MLA, GQA.
- GEMM: Tensor core FP16/BF16, fused with bias/activation, quantized formats (W8A8, W4A16, FP8).
- Normalization: RMSNorm (fused residual), LayerNorm (forward/backward).
- Others: RoPE, SwiGLU, quantization (FP8/INT8/INT4), MoE components, sampling, SSM.

#### Smart Kernel Selection
- Auto routes to optimal implementation based on GPU (sm70-sm120), data type, operator.
- Transparent to developers: e.g., `ks.dispatch.rms_norm(x,w)` uses best available kernel.
- Fallback: Self-developed clean-room kernels ensure correctness when no optimal option exists.

#### Multi-language Support
- Precompiled `libkernel_set.so` (C ABI) allows bindings without CUDA setup.
- Seamless integration with PyTorch (Python), zero-cost abstraction (Rust), etc.

## Technical Implementation Details

#### Memory-bound Operators Optimization
- Focus on RMSNorm, SwiGLU, RoPE, AdamW (memory bandwidth bottleneck).
- Self-developed kernels achieve **84-87% of A100 peak bandwidth** (on par with FlashInfer/Liger).

#### Compute-bound Operators Strategy
- For GEMM, Attention, MoE (compute-intensive), route to industry-leading libs (cuBLAS, FlashAttention, DeepGEMM).
- Self-developed kernels act as fallback for portability.

#### Build System & Hardware Support
- CMake 3.24+ & CUDA 12.x.
- Supports sm70 (T4/V100) to sm120 (Blackwell).
- Precompiled wheels (sm75-sm120) with static CUDA runtime (no extra compilation needed).

## Practical Application & Verification

#### Validation Methods
- `examples/eval_model.py` hot-replaces kernel-set operators into HuggingFace models for bit-level comparison.

#### Key Results
- **Gemma-2-2B**: Bit-level consistent output for 64 greedy decoding tokens.
- **Qwen2.5**: 100% Top-1 correctness match.
- **Speedup**: 3-9x faster than eager PyTorch for individual operators.

#### Tested GPUs
- L4 (sm89), A100 (sm80), RTX PRO 6000 Blackwell (sm120).

## Ecosystem Integration & Toolchain

#### ksctl Command-line Tool
- Generate optimal kernel config: `python3 models/ksctl plan --model deepseek-v3 --gpu h100 --dtype fp8`.

#### Model Kernel Mappings
- Maintains mappings for **157 mainstream models** (DeepSeek-V4, GLM-5, Kimi-2.6, Gemma-4, Llama4) to required kernels.

#### Documentation
- Detailed guides: optimal selection mechanism, routing tables, quantization.
- Full catalog: 127 logical operators, 476 atomic operators.

## Open Source License & Contribution

#### License
- kernel-set uses **Apache-2.0** license (commercial-friendly).
- Self-developed kernels: clean-room implementation.
- Third-party code: retains original licenses (see `THIRD_PARTY_NOTICES.md`).

#### Contribution
Open to community contributions while respecting upstream intellectual property.

## Summary & Future Outlook

#### Core Value
kernel-set provides:
1. **Simplification**: One API for 78 operators (no need to learn multiple libraries).
2. **Performance**: Auto optimal selection; memory-bound operators reach 84-87% peak bandwidth.
3. **Portability**: Runs on T4 to Blackwell GPUs.
4. **Multi-language**: Python/Rust/Go/TypeScript support.

#### Outlook
As LLMs scale and hardware evolves, unified abstraction layers like kernel-set will become critical—letting developers focus on models/apps instead of kernel optimizations.
