Zing Forum

Reading

Lynn Engine: A Native LLM Inference Engine Built Exclusively for NVIDIA Blackwell

Lynn Engine is an LLM inference engine built from scratch, optimized for Lynn's own variable pruning MoE models and the NVFP4 format. The project aims to become a parallel mainline comparable to llama.cpp, enabling efficient inference on Blackwell architecture GPUs such as R6000/Spark.

LLM推理NVIDIA BlackwellNVFP4量化CUDATritonMoE投机解码llama.cppQwen内存优化
Published 2026-06-04 00:13Recent activity 2026-06-04 00:21Estimated read 7 min
Lynn Engine: A Native LLM Inference Engine Built Exclusively for NVIDIA Blackwell
1

Section 01

Introduction / Main Floor: Lynn Engine: A Native LLM Inference Engine Built Exclusively for NVIDIA Blackwell

Lynn Engine is an LLM inference engine built from scratch, optimized for Lynn's own variable pruning MoE models and the NVFP4 format. The project aims to become a parallel mainline comparable to llama.cpp, enabling efficient inference on Blackwell architecture GPUs such as R6000/Spark.

2

Section 02

Original Author and Source


3

Section 03

Project Background and Positioning

Lynn Engine is a native LLM inference engine designed specifically for the NVIDIA Blackwell architecture (sm_120/sm_121). Unlike frameworks that rely on existing tools (such as vLLM, SGLang, TensorRT-LLM, llama.cpp), Lynn Engine is written from scratch, focusing on Lynn's own variable pruning MoE (Mixture of Experts) models and the proprietary NVFP4 quantization format.

The project's strategic positioning has undergone a significant adjustment: on June 3, 2026, Lynn Engine was repositioned as a parallel mainline aiming to be comparable to llama.cpp, instead of being just an R&D exploration path as previously planned. In the short term, the client will still use llama.cpp/GGUF as the practical default backend, but the engine will be developed in parallel with the goal of matching or exceeding llama.cpp's performance under the same model and hardware conditions.


4

Section 04

1. Native NVFP4 Quantization Support

The core competitiveness of Lynn Engine lies in its native support for the NVFP4 (4-bit Floating Point) format. NVFP4 is a new quantization format introduced by the NVIDIA Blackwell architecture, which has better numerical performance compared to traditional INT4/INT8.

The project has implemented a complete NVFP4 inference pipeline:

  • W4A16 Quantization: Weights use 4-bit NVFP4, while activations remain in BF16
  • Self-developed CUDA/Triton Kernel: Instead of relying on PyTorch's _scaled_mm, handwritten kernels are used to achieve efficient matrix operations
  • Zero-shadow Memory Optimization: Reduces memory usage through packed tensor layout; the resident memory of a 35B model is reduced from 88GiB to 28GiB (saving approximately 60GiB)
5

Section 05

2. MoE (Mixture of Experts) Optimization

For MoE architecture models such as Qwen3.6-35B-A3B, Lynn Engine has implemented several key optimizations:

  • Active Expert Routing Optimization: Selects active experts via top-k routing to avoid computing all 30 experts
  • Grouped Native FP4 Kernel: Fuses the computation of multiple experts into a single kernel launch, reducing CUDA launch overhead
  • Shared Expert Fusion: Performs kernel fusion on shared experts to reduce dispatch overhead

Actual tests show that on the R6000 (sm_120a), the 27B model can achieve a strict default path performance of 107-108 TPS (tokens per second), and up to 123.78 TPS in serving replay mode.

6

Section 06

3. Speculative Decoding

The project is implementing Nemotron-style self-speculative decoding:

  • APEX-MTP Support: Integrates the official APEX/MTP sidecar to implement K=2 verify/accept/crop/full-accept/prefix-repair
  • Token-exact Verification: Ensures the numerical correctness of speculative decoding

On the Spark (sm_121), using the Qwen3.6-35B-A3B APEX-MTP I-Balanced configuration, the single-stream performance reaches 77.01 tok/s, which is a 27% improvement compared to the 60.65 tok/s of the autoregressive (AR) mode.


7

Section 07

35B Model Horizontal Comparison (Spark sm_121 GB10 Single Stream)

Path Model Size Single Stream TPS MMLU 500 GPQA Diamond 198
Lynn-native NVFP4 W4A16 23 GB 38.96 → ~45 84.40% 49.49%
llama.cpp Q4_K_M-imatrix 20 GB 69.77 83.00% 50.00%
llama.cpp APEX-MTP I-Balanced 25 GB 77.01 90.00% 78.79%
SGLang BF16 official 67 GB 30.14 86.40% 45.45%

Key Findings:

  • Under NVFP4 quantization, Lynn Engine's GPQA performance is roughly on par with BF16/Q4_K_M-imatrix (49.5±1pp), breaking the expectation of "NVFP4 quality advantage"
  • The gap with llama.cpp mainly comes from the maturity of CUDA kernels and dispatch optimizations, not the quantization format itself
8

Section 08

9B Default Shipping Candidate Model

For regular users, Lynn recommends Qwen3.5-9B Q4_K_M-imatrix (5.3GB) as the default local model:

  • MMLU 100 thinking-on excl_pf: 90.00%
  • GPQA Diamond 198: 81.71%
  • Spark sm_121 single stream TPS: 36.80
  • Total TPS with c=8 concurrency: 177.54