# Lynn Engine: A Native LLM Inference Engine Built Exclusively for NVIDIA Blackwell

> Lynn Engine is an LLM inference engine built from scratch, optimized for Lynn's own variable pruning MoE models and the NVFP4 format. The project aims to become a parallel mainline comparable to llama.cpp, enabling efficient inference on Blackwell architecture GPUs such as R6000/Spark.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-03T16:13:02.000Z
- 最近活动: 2026-06-03T16:21:56.509Z
- 热度: 163.8
- 关键词: LLM推理, NVIDIA Blackwell, NVFP4量化, CUDA, Triton, MoE, 投机解码, llama.cpp, Qwen, 内存优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/lynn-engine-nvidia-blackwell-llm
- Canonical: https://www.zingnex.cn/forum/thread/lynn-engine-nvidia-blackwell-llm
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Lynn Engine: A Native LLM Inference Engine Built Exclusively for NVIDIA Blackwell

Lynn Engine is an LLM inference engine built from scratch, optimized for Lynn's own variable pruning MoE models and the NVFP4 format. The project aims to become a parallel mainline comparable to llama.cpp, enabling efficient inference on Blackwell architecture GPUs such as R6000/Spark.

## Original Author and Source

- **Original Author/Maintainer**: MerkyorLynn
- **Source Platform**: GitHub
- **Original Project Name**: lynn-engine
- **Original Link**: https://github.com/MerkyorLynn/lynn-engine
- **Release Date**: 2026-06-03

---

## Project Background and Positioning

Lynn Engine is a native LLM inference engine designed specifically for the NVIDIA Blackwell architecture (sm_120/sm_121). Unlike frameworks that rely on existing tools (such as vLLM, SGLang, TensorRT-LLM, llama.cpp), Lynn Engine is written from scratch, focusing on Lynn's own variable pruning MoE (Mixture of Experts) models and the proprietary NVFP4 quantization format.

The project's strategic positioning has undergone a significant adjustment: on June 3, 2026, Lynn Engine was repositioned as a **parallel mainline** aiming to be comparable to llama.cpp, instead of being just an R&D exploration path as previously planned. In the short term, the client will still use llama.cpp/GGUF as the practical default backend, but the engine will be developed in parallel with the goal of matching or exceeding llama.cpp's performance under the same model and hardware conditions.

---

## 1. Native NVFP4 Quantization Support

The core competitiveness of Lynn Engine lies in its native support for the NVFP4 (4-bit Floating Point) format. NVFP4 is a new quantization format introduced by the NVIDIA Blackwell architecture, which has better numerical performance compared to traditional INT4/INT8.

The project has implemented a complete NVFP4 inference pipeline:
- **W4A16 Quantization**: Weights use 4-bit NVFP4, while activations remain in BF16
- **Self-developed CUDA/Triton Kernel**: Instead of relying on PyTorch's `_scaled_mm`, handwritten kernels are used to achieve efficient matrix operations
- **Zero-shadow Memory Optimization**: Reduces memory usage through packed tensor layout; the resident memory of a 35B model is reduced from 88GiB to 28GiB (saving approximately 60GiB)

## 2. MoE (Mixture of Experts) Optimization

For MoE architecture models such as Qwen3.6-35B-A3B, Lynn Engine has implemented several key optimizations:

- **Active Expert Routing Optimization**: Selects active experts via top-k routing to avoid computing all 30 experts
- **Grouped Native FP4 Kernel**: Fuses the computation of multiple experts into a single kernel launch, reducing CUDA launch overhead
- **Shared Expert Fusion**: Performs kernel fusion on shared experts to reduce dispatch overhead

Actual tests show that on the R6000 (sm_120a), the 27B model can achieve a strict default path performance of **107-108 TPS** (tokens per second), and up to **123.78 TPS** in serving replay mode.

## 3. Speculative Decoding

The project is implementing Nemotron-style self-speculative decoding:
- **APEX-MTP Support**: Integrates the official APEX/MTP sidecar to implement K=2 verify/accept/crop/full-accept/prefix-repair
- **Token-exact Verification**: Ensures the numerical correctness of speculative decoding

On the Spark (sm_121), using the Qwen3.6-35B-A3B APEX-MTP I-Balanced configuration, the single-stream performance reaches **77.01 tok/s**, which is a **27% improvement** compared to the 60.65 tok/s of the autoregressive (AR) mode.

---

## 35B Model Horizontal Comparison (Spark sm_121 GB10 Single Stream)

| Path | Model Size | Single Stream TPS | MMLU 500 | GPQA Diamond 198 |
|------|---------|---------|---------|-----------------|
| Lynn-native NVFP4 W4A16 | 23 GB | **38.96 → ~45** | 84.40% | 49.49% |
| llama.cpp Q4_K_M-imatrix | 20 GB | **69.77** | 83.00% | **50.00%** |
| llama.cpp APEX-MTP I-Balanced | 25 GB | **77.01** | **90.00%** | **78.79%** |
| SGLang BF16 official | 67 GB | 30.14 | 86.40% | 45.45% |

Key Findings:
- Under NVFP4 quantization, Lynn Engine's GPQA performance is roughly on par with BF16/Q4_K_M-imatrix (49.5±1pp), breaking the expectation of "NVFP4 quality advantage"
- The gap with llama.cpp mainly comes from the maturity of CUDA kernels and dispatch optimizations, not the quantization format itself

## 9B Default Shipping Candidate Model

For regular users, Lynn recommends Qwen3.5-9B Q4_K_M-imatrix (5.3GB) as the default local model:
- MMLU 100 thinking-on excl_pf: **90.00%**
- GPQA Diamond 198: **81.71%**
- Spark sm_121 single stream TPS: **36.80**
- Total TPS with c=8 concurrency: **177.54**

---
