Zing Forum

Reading

XL-Persistent-Kernel: Exploration of Persistent GPU Kernel Architecture for Ultra-Low Latency LLM Inference

This article introduces the XL-Persistent-Kernel project, a research framework exploring the persistent GPU megakernel execution model. It aims to integrate stages like prefill, decoding, and speculative verification in LLM inference services into a single GPU-resident execution loop, thereby significantly reducing CPU scheduling overhead and kernel launch latency.

LLM推理GPU优化持久化内核CUDA投机解码KV缓存低延迟大模型服务Mega-Kernel
Published 2026-06-11 02:40Recent activity 2026-06-11 02:49Estimated read 10 min
XL-Persistent-Kernel: Exploration of Persistent GPU Kernel Architecture for Ultra-Low Latency LLM Inference
1

Section 01

[Introduction] XL-Persistent-Kernel: Exploring Persistent GPU Kernel Architecture to Reduce LLM Inference Latency

XL-Persistent-Kernel: Exploration of Persistent GPU Kernel Architecture for Ultra-Low Latency LLM Inference

Core Idea: This project explores the persistent GPU megakernel execution model, integrating stages like prefill, decoding, and speculative verification in LLM inference into a single GPU-resident loop, aiming to significantly reduce CPU scheduling overhead and kernel launch latency. Source Information:

2

Section 02

Project Background and Motivation

As LLM scales to the trillion-parameter level, traditional inference service architectures face performance bottlenecks: in CPU-dominated scheduling mode, each token generation requires CPU to initiate GPU kernel calls, and frequent interactions lead to accumulated synchronization overhead and latency. XL-Persistent-Kernel explores the persistent GPU megakernel paradigm, migrating the inference control flow to the GPU interior, allowing the GPU to autonomously manage request lifecycle, scheduling decisions, and memory operations to eliminate kernel launch overhead and CPU-GPU synchronization bottlenecks in traditional architectures.

3

Section 03

Architecture Design and Core Advantages

Architecture Design Overview

Model logical stages such as prefill, decoding, speculative verification, submission, and KV cache lifecycle management as logical stages inside a single persistent GPU kernel, rather than independent kernel calls.

Request Lifecycle Flow

  1. Request submission → 2. Prefill worker builds initial KV cache →3. KV page planner allocates physical pages →4. Decoding worker runs decoding loop →5. Speculative proposer generates candidate token blocks →6. Validator verifies candidates →7. Submit accepted tokens/release rejected drafts →8. Request completion (EOS/budget exhausted/target matched)

Megakernel Design Philosophy and Advantages

Philosophy: The inference service pipeline should be a single megakernel resident inside the GPU, rather than a long chain of kernels initiated by the CPU. Advantages: Reduce repeated kernel launches, eliminate CPU scheduling overhead, minimize CPU-GPU synchronization, optimize GPU execution fragmentation, and keep KV cache GPU-resident.

4

Section 04

Technical Implementation Details

Current Implementation Status

Provides a complete Python runtime simulator with core components including:

  • Runtime simulator (prefill/decoding workers)
  • Speculative block proposal and verification (configurable acceptance strategy)
  • Paged KV cache planner (LRU eviction, page locking, etc.)
  • Backend interface (abstract kernel + CPU stub)
  • Benchmark framework (exports metrics like TTFT, ITL)
  • CUDA stub layer (xl_persistent_megakernel and baseline kernels)
  • CI pipeline (pytest+ruff+mypy tests)

Component Architecture Table

Component Role Current Status Future Plan
xl_persistent_megakernel Integrated resident GPU control loop Deterministic control flow stub Real integrated inference pipeline
stage_prefill Logical prefill stage Metadata only Real prefill attention
stage_decode Logical decode stage Deterministic token generation Real decode kernel path
stage_spec_verify Speculative validator Deterministic accept/reject Target model verification
stage_commit Accept/submit stage Metadata conversion Integrated token/KV submission
stage_kv KV lifecycle helper Metadata only Real paged KV movement
stage_scheduler Device-side request selector Linear scan + priority GPU-resident scheduler
5

Section 05

Benchmarking and Performance Analysis

Benchmark Modes

Mode Description
serial_decode Block size 1, no speculation (CPU simulates host-initiated decoding)
speculative_decode Configurable block size draft/verify/submit loop
forced_rejection Forced periodic draft rejection with mismatched stride
kv_pressure Eviction pressure triggered by insufficient KV cache size
mega_kernel_sim Simulate integrated megakernel control path

Key Performance Metrics

  • TTFT (Time To First Token)
  • ITL (Inter-Token Latency)
  • Speculative decoding acceptance rate
  • KV cache hit rate
  • Active/locked KV bytes
  • Memory fragmentation ratio
6

Section 06

Project Limitations and Future Plans

Current Limitations

The current CUDA stub does not measure real Transformer mathematical operations, model quality, or production LLM throughput; it only measures orchestration structure (host launch count, synchronization count, request lifecycle progress, etc.).

To-Be-Implemented Features

  • Real CUDA attention/projection/sampling kernels
  • Integrated speculative verification kernel
  • Device-resident request descriptors and work queues
  • Multi-GPU/NVLink communication overlap
  • Continuous batching with dynamic request admission
  • Device-side real Transformer mathematical operations
  • Quantized weight and KV support
  • Memory-mapped model loading
7

Section 07

Practical Significance and Insights

XL-Persistent-Kernel provides an important research direction for the future architecture of LLM inference services. Although it is currently a control flow stub, it demonstrates the potential to achieve performance improvements by restructuring the CPU-GPU interaction model.

Value for LLM service infrastructure developers and researchers:

  1. New Architecture Perspective: Shift from CPU-centric to GPU-centric scheduling mode
  2. Scalable Code Framework: Modular design supports gradual replacement with real implementations
  3. Benchmarking Tool: Evaluate the effects of different optimization strategies
  4. Research Community Resource: Open-source code and documentation facilitate reproduction and expansion