Reading

AMD-NFS: A Native LLM Inference Stack to Break CUDA Monopoly

AMD-NFS is an LLM inference and service stack built from scratch, designed to bypass CUDA ecosystem lock-in, natively support ROCm/HIP, and replace traditional service software like vLLM and llama.cpp.

AMDROCmHIPLLM推理CUDA替代GPU计算开源AI推理优化

Published 2026-04-24 14:43Recent activity 2026-04-24 14:50Estimated read 5 min

Section 01

Introduction / Main Floor: AMD-NFS: A Native LLM Inference Stack to Break CUDA Monopoly

Section 02

Background: The Monopoly Dilemma of the CUDA Ecosystem

The current large language model (LLM) inference ecosystem is almost entirely dominated by NVIDIA's CUDA. From vLLM to llama.cpp, from Triton Inference Server to various optimization frameworks, most open-source projects prioritize or even only support the CUDA platform. This ecosystem lock-in not only limits the diversity of hardware choices but also keeps competitors' GPUs like AMD in a marginal position in the AI inference field for a long time.

For developers using AMD GPUs, this means either giving up performance optimization or struggling with compatibility layers. ROCm, as AMD's open-source GPU computing platform, provides HIP (Heterogeneous-compute Interface for Portability) to simulate CUDA interfaces, but most existing software stacks have not been deeply optimized for AMD hardware.

Section 03

Project Overview: The Vision of AMD's Native Inference Stack

AMD-NFS (AMD-Native Inference Stack) was born to address this pain point. It is an LLM inference and service stack built from scratch, whose core goal is to completely bypass CUDA ecosystem lock-in, natively support AMD's ROCm/HIP platform, and provide a unified, high-performance alternative.

Unlike the approach of adding HIP compatibility layers on top of existing CUDA code, AMD-NFS has chosen a more ambitious path: redesigning the entire inference stack to be optimized for AMD GPU architecture from the ground up. This includes deep customization at all levels such as memory management, kernel scheduling, and parallel computing modes.

Section 04

Technical Architecture: A Modular Stack with Layered Design

AMD-NFS adopts a clear layered architecture design, dividing the system into multiple independent but collaborative modules:

Section 05

C Language Bottom Layer: Memory and Kernel Management

The bottom layer is implemented in C, including slab allocators and HIP kernel stubs. Slab allocator is an efficient memory management technique that pre-allocates fixed-size memory blocks to reduce runtime allocation overhead, which is crucial for LLM inference that requires frequent memory operations. HIP kernel stubs provide the basic interface for subsequent GPU computations.

Section 06

C++ Engine Core

The middle layer uses C++ to build the engine core skeleton, responsible for key functions such as model loading, inference scheduling, and batch processing management. C++'s performance advantages and fine-grained control over hardware make it an ideal choice for building high-performance inference engines.

Section 07

Python Binding Layer

Python bindings are provided via Cython, allowing developers to use familiar Python interfaces to call underlying high-performance implementations. This layer also includes setup.py for easy installation and deployment, lowering the barrier to use.

Section 08

Go Language Service Layer

The top layer uses Go to build the server skeleton, leveraging Go's advantages in concurrent processing and network services to provide high-throughput model service interfaces. Go's lightweight goroutine model is particularly suitable for handling a large number of concurrent inference requests.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49