Zing 论坛

正文

kernel-set:统一C ABI的高性能LLM推理与训练CUDA内核库

kernel-set通过统一的C ABI封装78种LLM核心算子,支持Python、Rust、Go、TypeScript多语言调用,自动选择最优内核实现,为大规模语言模型推理和训练提供跨平台高性能计算解决方案。

CUDALLM推理GPU内核FlashAttentionGEMM量化多语言绑定高性能计算Transformer深度学习
发布时间 2026/06/05 20:14最近活动 2026/06/05 20:21预计阅读 8 分钟
kernel-set:统一C ABI的高性能LLM推理与训练CUDA内核库
1

章节 01

kernel-set: Unified C ABI High-Performance CUDA Kernel Library for LLM Inference & Training (Main Thread)

Core Overview

kernel-set is a high-performance CUDA kernel library for LLM inference and training, featuring:

  • Unified C ABI: Encapsulates 78 core LLM operators, abstracting diverse kernel implementations.
  • Multi-language Support: Binds to Python, Rust, Go, TypeScript for cross-language access.
  • Auto Optimal Selection: Smart dispatcher chooses best kernel based on GPU architecture, data type, and operator type.

Source Info

2

章节 02

Background & Motivation

In LLM inference/training, GPU kernel performance directly impacts efficiency. However, existing excellent kernels (FlashAttention, vLLM, DeepGEMM, etc.) have fragmented interfaces and optimization scenarios. Developers face:

  1. Need to deeply understand each library to select optimal kernels.
  2. Manual conditional code increases complexity and reduces portability.

kernel-set addresses this by unifying 78 core operators under a stable C ABI, enabling one API to access best-in-class implementations.

3

章节 03

Core Architecture & Key Features

Layered Architecture

  • Bottom Layer: Mix of self-developed clean-room kernels and third-party optimized implementations.
  • Middle Layer: Unified C ABI interface (stable, cross-platform).
  • Top Layer: Multi-language bindings (Python/Rust/Go/TypeScript) for easy integration.

78 Core Operators Coverage

Covers full LLM lifecycle:

  • Attention: FlashAttention-2, KV cache, MLA, GQA.
  • GEMM: Tensor core FP16/BF16, fused with bias/activation, quantized formats (W8A8, W4A16, FP8).
  • Normalization: RMSNorm (fused residual), LayerNorm (forward/backward).
  • Others: RoPE, SwiGLU, quantization (FP8/INT8/INT4), MoE components, sampling, SSM.

Smart Kernel Selection

  • Auto routes to optimal implementation based on GPU (sm70-sm120), data type, operator.
  • Transparent to developers: e.g., ks.dispatch.rms_norm(x,w) uses best available kernel.
  • Fallback: Self-developed clean-room kernels ensure correctness when no optimal option exists.

Multi-language Support

  • Precompiled libkernel_set.so (C ABI) allows bindings without CUDA setup.
  • Seamless integration with PyTorch (Python), zero-cost abstraction (Rust), etc.
4

章节 04

Technical Implementation Details

Memory-bound Operators Optimization

  • Focus on RMSNorm, SwiGLU, RoPE, AdamW (memory bandwidth bottleneck).
  • Self-developed kernels achieve 84-87% of A100 peak bandwidth (on par with FlashInfer/Liger).

Compute-bound Operators Strategy

  • For GEMM, Attention, MoE (compute-intensive), route to industry-leading libs (cuBLAS, FlashAttention, DeepGEMM).
  • Self-developed kernels act as fallback for portability.

Build System & Hardware Support

  • CMake 3.24+ & CUDA 12.x.
  • Supports sm70 (T4/V100) to sm120 (Blackwell).
  • Precompiled wheels (sm75-sm120) with static CUDA runtime (no extra compilation needed).
5

章节 05

Practical Application & Verification

Validation Methods

  • examples/eval_model.py hot-replaces kernel-set operators into HuggingFace models for bit-level comparison.

Key Results

  • Gemma-2-2B: Bit-level consistent output for 64 greedy decoding tokens.
  • Qwen2.5: 100% Top-1 correctness match.
  • Speedup: 3-9x faster than eager PyTorch for individual operators.

Tested GPUs

  • L4 (sm89), A100 (sm80), RTX PRO 6000 Blackwell (sm120).
6

章节 06

Ecosystem Integration & Toolchain

ksctl Command-line Tool

  • Generate optimal kernel config: python3 models/ksctl plan --model deepseek-v3 --gpu h100 --dtype fp8.

Model Kernel Mappings

  • Maintains mappings for 157 mainstream models (DeepSeek-V4, GLM-5, Kimi-2.6, Gemma-4, Llama4) to required kernels.

Documentation

  • Detailed guides: optimal selection mechanism, routing tables, quantization.
  • Full catalog: 127 logical operators, 476 atomic operators.
7

章节 07

Open Source License & Contribution

License

  • kernel-set uses Apache-2.0 license (commercial-friendly).
  • Self-developed kernels: clean-room implementation.
  • Third-party code: retains original licenses (see THIRD_PARTY_NOTICES.md).

Contribution

Open to community contributions while respecting upstream intellectual property.

8

章节 08

Summary & Future Outlook

Core Value

kernel-set provides:

  1. Simplification: One API for 78 operators (no need to learn multiple libraries).
  2. Performance: Auto optimal selection; memory-bound operators reach 84-87% peak bandwidth.
  3. Portability: Runs on T4 to Blackwell GPUs.
  4. Multi-language: Python/Rust/Go/TypeScript support.

Outlook

As LLMs scale and hardware evolves, unified abstraction layers like kernel-set will become critical—letting developers focus on models/apps instead of kernel optimizations.