Zing Forum

Reading

Tencent Open-sources hpc-ops: High-performance LLM Inference Operator Library with 2.22x Decoding Speed Improvement

Tencent Hunyuan AI Infrastructure Team has open-sourced hpc-ops, a high-performance LLM inference operator library deeply optimized for NVIDIA H20 GPUs. It achieves up to 2.22x acceleration in the decoding phase and has been validated in Tencent's large-scale production environment.

LLM推理CUDA优化算子库腾讯H20FP8量化Hopper架构开源
Published 2026-04-09 19:05Recent activity 2026-04-09 19:16Estimated read 7 min
Tencent Open-sources hpc-ops: High-performance LLM Inference Operator Library with 2.22x Decoding Speed Improvement
1

Section 01

[Introduction] Tencent Open-sources hpc-ops: H20 GPU-optimized LLM Inference Operator Library with 2.22x Decoding Acceleration

Tencent Hunyuan AI Infrastructure Team has open-sourced hpc-ops, a high-performance LLM inference operator library deeply optimized for NVIDIA H20 GPUs. This library achieves up to 2.22x acceleration in the decoding phase, has been validated in Tencent's large-scale production environment, and aims to provide the community with high-performance operator implementations while lowering integration barriers.

2

Section 02

Background: LLM Inference Performance Bottlenecks and Optimization Needs

As the scale of Large Language Models (LLMs) expands, inference performance has become a key bottleneck for AI application deployment. High-throughput and low-latency services in production environments directly impact user experience and costs. While mainstream frameworks like vLLM and SGLang have baseline performance, there is still room for deep optimization for specific hardware. Based on production practice, Tencent found that targeted operator optimization can significantly improve efficiency, hence developing and open-sourcing hpc-ops.

3

Section 03

Introduction to hpc-ops and Core Technical Features

hpc-ops is a high-performance LLM inference operator library developed by Tencent Hunyuan Team, deeply optimized for NVIDIA H20 GPUs, validated in Tencent's large-scale production, and open-sourced. Its core goal is to provide industry-leading performance for key operators while being compatible with mainstream inference frameworks. Technical features include: production-level stability (validated in high-pressure scenarios), easy integration (simple API for seamless access to vLLM/SGLang), rich precision support (BF16/FP8, etc., multiple quantization schemes), and modern CUDA tutorial value (practical examples of CuTe/CUTLASS with clean code).

4

Section 04

Core Performance Metrics: Significant Acceleration for Multiple Operators

hpc-ops achieves significant acceleration across multiple key operators:

  • Attention operator (BF16): 1.33x in Prefill phase, 2.22x in Decode phase (compared to FlashInfer, FlashAttention 2/3, TensorRT-LLM);
  • Attention operator (FP8): 1.12x in Prefill phase, 2.0x in Decode phase (compared to FlashInfer, FlashAttention3, TensorRT-LLM);
  • FusedMoE operator (FP8): 1.49x in Prefill phase, 1.14x in Decode phase (compared to TensorRT-LLM, vLLM);
  • GroupGEMM operator (FP8): 1.1x in Prefill phase, 1.88x in Decode phase (compared to DeepGEMM). These improvements mean lower latency, higher throughput, and better cost-effectiveness.
5

Section 05

Supported Operator Types and Runtime Environment Requirements

Supported Operators:

  • Decoding and prefill optimization: Optimization for Prefill/Decode phases of the Attention mechanism, supporting paged attention;
  • Quantized GroupGEMM: FP8 weight group matrix multiplication with block-level/tensor-level scaling;
  • Quantized FusedMoE: FP8 expert weight fused mixture-of-experts operator with flexible scaling strategies. Runtime Environment Requirements:
  • GPU architecture: NVIDIA SM90 (e.g., Hopper architecture like H20, H100);
  • Python: 3.8+;
  • Compiler: C++17 compatible;
  • CUDA toolkit: 12.8+.
6

Section 06

Technical Implementation Highlights: Deep Optimization Drives Performance Improvement

The performance improvement of hpc-ops comes from multiple optimization aspects:

  • Memory access optimization: Fine-grained memory layout and access pattern design to maximize GPU bandwidth utilization;
  • Computational parallelism improvement: Instruction-level optimization for Hopper architecture Tensor Cores to increase compute unit utilization;
  • Quantization-aware implementation: Deep integration of quantization logic at the operator level to avoid precision conversion overhead;
  • Fusion strategy: Fusing multiple small operators into a single kernel to reduce launch and intermediate result write-back overhead.
7

Section 07

Future Roadmap and Community Participation

Future Roadmap:

  • Sparse attention operators: Optimize sparse attention kernels for long-context LLMs;
  • Extended quantization support: Develop 4bit/8bit mixed-precision strategies;
  • Compute-communication fusion: Overlap computation and inter-GPU communication to reduce distributed inference overhead. Open-source Significance and Community Participation:
  • Provide production-validated high-performance operators to help the community improve inference efficiency;
  • CuTe/CUTLASS examples can serve as learning resources for modern CUDA;
  • Community contributions (bug fixes, scenario optimizations, etc.) are welcome, and the project uses a friendly open-source license. Code can be obtained via the GitHub repository, and issues or PRs can be submitted.