Reading

Lynn Engine: A Native LLM Inference Engine Built Exclusively for NVIDIA Blackwell

Lynn Engine is an LLM inference engine built from scratch, optimized for Lynn's own variable pruning MoE models and the NVFP4 format. The project aims to become a parallel mainline comparable to llama.cpp, enabling efficient inference on Blackwell architecture GPUs such as R6000/Spark.

LLM推理NVIDIA BlackwellNVFP4量化CUDATritonMoE投机解码llama.cppQwen内存优化

Published 2026-06-04 00:13Recent activity 2026-06-04 00:21Estimated read 7 min

Section 01

Introduction / Main Floor: Lynn Engine: A Native LLM Inference Engine Built Exclusively for NVIDIA Blackwell

Section 02

Original Author and Source

Original Author/Maintainer: MerkyorLynn
Source Platform: GitHub
Original Project Name: lynn-engine
Original Link: https://github.com/MerkyorLynn/lynn-engine
Release Date: 2026-06-03

Section 03

Project Background and Positioning

Lynn Engine is a native LLM inference engine designed specifically for the NVIDIA Blackwell architecture (sm_120/sm_121). Unlike frameworks that rely on existing tools (such as vLLM, SGLang, TensorRT-LLM, llama.cpp), Lynn Engine is written from scratch, focusing on Lynn's own variable pruning MoE (Mixture of Experts) models and the proprietary NVFP4 quantization format.

The project's strategic positioning has undergone a significant adjustment: on June 3, 2026, Lynn Engine was repositioned as a parallel mainline aiming to be comparable to llama.cpp, instead of being just an R&D exploration path as previously planned. In the short term, the client will still use llama.cpp/GGUF as the practical default backend, but the engine will be developed in parallel with the goal of matching or exceeding llama.cpp's performance under the same model and hardware conditions.

Section 04

1. Native NVFP4 Quantization Support

The core competitiveness of Lynn Engine lies in its native support for the NVFP4 (4-bit Floating Point) format. NVFP4 is a new quantization format introduced by the NVIDIA Blackwell architecture, which has better numerical performance compared to traditional INT4/INT8.

The project has implemented a complete NVFP4 inference pipeline:

W4A16 Quantization: Weights use 4-bit NVFP4, while activations remain in BF16
Self-developed CUDA/Triton Kernel: Instead of relying on PyTorch's _scaled_mm, handwritten kernels are used to achieve efficient matrix operations
Zero-shadow Memory Optimization: Reduces memory usage through packed tensor layout; the resident memory of a 35B model is reduced from 88GiB to 28GiB (saving approximately 60GiB)

Section 05

2. MoE (Mixture of Experts) Optimization

For MoE architecture models such as Qwen3.6-35B-A3B, Lynn Engine has implemented several key optimizations:

Active Expert Routing Optimization: Selects active experts via top-k routing to avoid computing all 30 experts
Grouped Native FP4 Kernel: Fuses the computation of multiple experts into a single kernel launch, reducing CUDA launch overhead
Shared Expert Fusion: Performs kernel fusion on shared experts to reduce dispatch overhead

Actual tests show that on the R6000 (sm_120a), the 27B model can achieve a strict default path performance of 107-108 TPS (tokens per second), and up to 123.78 TPS in serving replay mode.

Section 06

3. Speculative Decoding

The project is implementing Nemotron-style self-speculative decoding:

APEX-MTP Support: Integrates the official APEX/MTP sidecar to implement K=2 verify/accept/crop/full-accept/prefix-repair
Token-exact Verification: Ensures the numerical correctness of speculative decoding

On the Spark (sm_121), using the Qwen3.6-35B-A3B APEX-MTP I-Balanced configuration, the single-stream performance reaches 77.01 tok/s, which is a 27% improvement compared to the 60.65 tok/s of the autoregressive (AR) mode.

Section 07

35B Model Horizontal Comparison (Spark sm_121 GB10 Single Stream)

Path	Model Size	Single Stream TPS	MMLU 500	GPQA Diamond 198
Lynn-native NVFP4 W4A16	23 GB	38.96 → ~45	84.40%	49.49%
llama.cpp Q4_K_M-imatrix	20 GB	69.77	83.00%	50.00%
llama.cpp APEX-MTP I-Balanced	25 GB	77.01	90.00%	78.79%
SGLang BF16 official	67 GB	30.14	86.40%	45.45%

Key Findings:

Under NVFP4 quantization, Lynn Engine's GPQA performance is roughly on par with BF16/Q4_K_M-imatrix (49.5±1pp), breaking the expectation of "NVFP4 quality advantage"
The gap with llama.cpp mainly comes from the maturity of CUDA kernels and dispatch optimizations, not the quantization format itself

Section 08

9B Default Shipping Candidate Model

For regular users, Lynn recommends Qwen3.5-9B Q4_K_M-imatrix (5.3GB) as the default local model:

MMLU 100 thinking-on excl_pf: 90.00%
GPQA Diamond 198: 81.71%
Spark sm_121 single stream TPS: 36.80
Total TPS with c=8 concurrency: 177.54

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49