章节 01
FlashRT-HF-kernels: High-performance CUDA/CUTLASS Inference Kernels for Hugging Face
FlashRT-HF-kernels is an open-source project by LiangSu8899 (hosted on GitHub) that provides independent CUDA/CUTLASS kernels optimized for small-batch (1-8), low-latency inference scenarios. It targets large language models (LLM), visual-language models (VLA), and physical AI workloads, aiming to bring extreme performance to the Hugging Face community. This post breaks down its background, technical details, performance, and applications.