# VDCores: A Resource-Decoupled Programming and Execution Model for Asynchronous GPUs

> This article introduces VDCores, a decoupled programming model designed for the asynchronous hardware features of modern GPUs. By representing workloads as dependency-connected micro-operations and automatically scheduling overlapping memory operations and computations, it significantly improves LLM inference throughput while greatly reducing kernel programming complexity.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-04T22:17:44.000Z
- 最近活动: 2026-05-06T03:50:24.620Z
- 热度: 110.5
- 关键词: GPU编程, 异步执行, 资源解耦, LLM推理优化, 微操作, 虚拟核心, 计算架构
- 页面链接: https://www.zingnex.cn/en/forum/thread/vdcores-gpu
- Canonical: https://www.zingnex.cn/forum/thread/vdcores-gpu
- Markdown 来源: floors_fallback

---

## Introduction: VDCores—A Resource-Decoupled Programming Model for Asynchronous GPUs

This article introduces VDCores, a decoupled programming model designed for the asynchronous hardware features of modern GPUs. By representing workloads as dependency-connected micro-operations and automatically scheduling overlapping memory operations and computations, it addresses the mismatch between traditional monolithic kernel programming models and GPU heterogeneous hardware, significantly improving LLM inference throughput while greatly reducing kernel programming complexity.

## Background: Mismatch Between Traditional GPU Programming Models and Hardware Architecture

Modern GPUs are equipped with various asynchronous hardware units (e.g., copy engines, tensor cores, etc.), but the traditional CUDA programming model uses monolithic kernels as the unit, implying synchronous execution and static scheduling assumptions. This makes cross-unit parallelism difficult to achieve, serializes memory transfers and computations, and causes resource waste.

## Core Approach of VDCores: Virtual Decoupling Engine and Dynamic Scheduling

The core idea of VDCores is the virtual decoupling engine: it decomposes workloads into fine-grained micro-operations (with explicit dependencies and resource independence) and abstracts the GPU into virtual cores (corresponding to heterogeneous resources). At runtime, through hardware-accelerated dependency tracking, greedy scheduling strategies, and compilation optimizations, it achieves automatic overlapping of memory operations and computations, balancing flexibility and overhead.

## Performance and Programming Efficiency Improvements in LLM Inference Scenarios

Tests on three GPUs (GH200, H100, RTX6000 Pro) show that VDCores increases decoding throughput by an average of 24%, with a maximum of 77% in dynamic input scenarios; the amount of kernel code required to achieve the same functionality is reduced by 90%, significantly lowering the barrier to GPU programming.

## Technical Challenges and Solutions

VDCores overcomes three major challenges: 1. Trade-off of micro-operation granularity (adaptive granularity strategy); 2. Memory footprint of dependency graphs (compression techniques, shared memory caching); 3. Compatibility with CUDA ecosystem (progressive migration, allowing embedding of traditional kernels).

## Impact on AI Infrastructure and Future Outlook

VDCores promotes GPU programming from manual optimization to automatic optimization, provides a unified abstraction for heterogeneous computing, and adapts to cloud-native dynamic scenarios. It has been open-sourced (https://github.com/vdcores/vdcores). In the future, it will expand hardware support, develop dedicated compilers, design user-friendly programming interfaces, and integrate with ML frameworks.