Reading

VDCores: A Resource-Decoupled Programming and Execution Model for Asynchronous GPUs

This article introduces VDCores, a decoupled programming model designed for the asynchronous hardware features of modern GPUs. By representing workloads as dependency-connected micro-operations and automatically scheduling overlapping memory operations and computations, it significantly improves LLM inference throughput while greatly reducing kernel programming complexity.

GPU编程异步执行资源解耦LLM推理优化微操作虚拟核心计算架构

Published 2026-05-05 06:17Recent activity 2026-05-06 11:50Estimated read 4 min

VDCores: A Resource-Decoupled Programming and Execution Model for Asynchronous GPUs

Section 01

Introduction: VDCores—A Resource-Decoupled Programming Model for Asynchronous GPUs

This article introduces VDCores, a decoupled programming model designed for the asynchronous hardware features of modern GPUs. By representing workloads as dependency-connected micro-operations and automatically scheduling overlapping memory operations and computations, it addresses the mismatch between traditional monolithic kernel programming models and GPU heterogeneous hardware, significantly improving LLM inference throughput while greatly reducing kernel programming complexity.

Section 02

Background: Mismatch Between Traditional GPU Programming Models and Hardware Architecture

Modern GPUs are equipped with various asynchronous hardware units (e.g., copy engines, tensor cores, etc.), but the traditional CUDA programming model uses monolithic kernels as the unit, implying synchronous execution and static scheduling assumptions. This makes cross-unit parallelism difficult to achieve, serializes memory transfers and computations, and causes resource waste.

Section 03

Core Approach of VDCores: Virtual Decoupling Engine and Dynamic Scheduling

The core idea of VDCores is the virtual decoupling engine: it decomposes workloads into fine-grained micro-operations (with explicit dependencies and resource independence) and abstracts the GPU into virtual cores (corresponding to heterogeneous resources). At runtime, through hardware-accelerated dependency tracking, greedy scheduling strategies, and compilation optimizations, it achieves automatic overlapping of memory operations and computations, balancing flexibility and overhead.

Section 04

Performance and Programming Efficiency Improvements in LLM Inference Scenarios

Tests on three GPUs (GH200, H100, RTX6000 Pro) show that VDCores increases decoding throughput by an average of 24%, with a maximum of 77% in dynamic input scenarios; the amount of kernel code required to achieve the same functionality is reduced by 90%, significantly lowering the barrier to GPU programming.

Section 05

Technical Challenges and Solutions

VDCores overcomes three major challenges: 1. Trade-off of micro-operation granularity (adaptive granularity strategy); 2. Memory footprint of dependency graphs (compression techniques, shared memory caching); 3. Compatibility with CUDA ecosystem (progressive migration, allowing embedding of traditional kernels).

Section 06

Impact on AI Infrastructure and Future Outlook

VDCores promotes GPU programming from manual optimization to automatic optimization, provides a unified abstraction for heterogeneous computing, and adapts to cloud-native dynamic scenarios. It has been open-sourced (https://github.com/vdcores/vdcores). In the future, it will expand hardware support, develop dedicated compilers, design user-friendly programming interfaces, and integrate with ML frameworks.

VDCores: A Resource-Decoupled Programming and Execution Model for Asynchronous GPUs

Introduction: VDCores—A Resource-Decoupled Programming Model for Asynchronous GPUs

Background: Mismatch Between Traditional GPU Programming Models and Hardware Architecture

Core Approach of VDCores: Virtual Decoupling Engine and Dynamic Scheduling

Performance and Programming Efficiency Improvements in LLM Inference Scenarios

Technical Challenges and Solutions

Impact on AI Infrastructure and Future Outlook

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model