Zing Forum

Reading

LoKA: A Systematic Framework for Enabling FP8 Low-Precision Computing in Large-Scale Recommendation Models

LoKA addresses the numerical sensitivity and communication bottlenecks faced by recommendation models in FP8 low-precision computing through system-model co-design, achieving a balance between training efficiency and model quality.

FP8推荐模型低精度计算数值稳定性GPU优化模型训练系统协同设计矩阵运算
Published 2026-05-12 01:32Recent activity 2026-05-12 14:18Estimated read 7 min
LoKA: A Systematic Framework for Enabling FP8 Low-Precision Computing in Large-Scale Recommendation Models
1

Section 01

Introduction: LoKA Framework—A Systematic Solution for Enabling FP8 Low-Precision Computing in Large-Scale Recommendation Models

LoKA addresses the numerical sensitivity and communication bottlenecks faced by Large-Scale Recommendation Models (LRMs) in FP8 low-precision computing through system-model co-design, achieving a balance between training efficiency and model quality. The framework includes three core principles: precise profiling based on real distributions, co-design of model components and hardware, and intelligent orchestration across kernel libraries, providing a systematic methodology for the implementation of FP8 in recommendation systems.

2

Section 02

Background: Differences in FP8 Implementation Between LLMs and LRMs

As a key feature of the new-generation GPU architecture, FP8 has achieved significant success in Large Language Models (LLMs), increasing the peak performance of matrix operations by several times. However, FP8 application faces obstacles in Large-Scale Recommendation Models (LRMs): LRMs consist of numerous small matrix multiplications, each GEMM is followed by a normalization layer sensitive to numerical precision, and training relies on cross-device communication leading to bandwidth bottlenecks. Direct application of FP8 easily leads to model quality degradation, longer training time, or even divergence, requiring new ideas of system-model co-design.

3

Section 03

Three Core Principles of the LoKA Framework

LoKA (Low-precision Kernel Applications) proposes three core principles to guide FP8 deployment in LRMs:

  1. Precise profiling based on real distributions: Continuously learn the distribution characteristics of activation values and weights in the online training environment, quantify the numerical error of each layer, and identify layers where FP8 can be safely used;
  2. Co-design of model components and hardware: Provide reusable model adaptation techniques, modify the architecture to enhance numerical stability and improve FP8 execution efficiency;
  3. Intelligent orchestration across kernel libraries: Use statistical insights to select the fastest kernel that meets precision requirements for each operation, and dynamically schedule to maximize computational throughput.
4

Section 04

LoKA Probe: A Statistics-Driven Error Quantification Tool

LoKA Probe is the foundation of the framework. It uses an online benchmarking method to collect statistical information of activation values and weights during real training, capturing distribution drift in training dynamics. Its core output is the error quantification index for each layer. By comparing the numerical differences between FP8 and high-precision formats (such as FP32/BF16), layers are classified into categories like 'safe' and 'unsafe', providing a basis for subsequent optimization.

5

Section 05

LoKA Mods: Reusable Model Adaptation Techniques

For the numerically sensitive areas identified by Probe, LoKA Mods provides model-level improvement solutions: such as adjusting the calculation order of normalization layers, introducing precision protection mechanisms for residual connections, and protecting the precision of intermediate results. These techniques can be seamlessly integrated into existing model architectures without large-scale reconstruction, enhancing numerical stability while improving FP8 execution efficiency.

6

Section 06

LoKA Dispatch: An Intelligent Kernel Selection Runtime System

LoKA Dispatch dynamically selects the optimal kernel during the execution phase: it maintains a kernel performance database that records the throughput and precision characteristics of different kernels; combined with the layer-level precision requirements from Probe, it selects the fastest kernel that meets the precision requirements. At the same time, it balances the efficiency of the computation graph pipeline, prioritizing kernel combinations that can be fused for execution to reduce memory round-trip overhead.

7

Section 07

Practical Significance and Industry Impact of LoKA

LoKA proves the feasibility of achieving FP8 acceleration while maintaining model quality, and is expected to reduce the training cost of recommendation models by 30-50%. It provides technical references for companies with large-scale recommendation systems such as Meta, Google, and ByteDance, serving as a bridge connecting GPU low-precision hardware capabilities and practical applications.

8

Section 08

Future Outlook: Generality of the LoKA Framework and Expansion of Low-Precision Technologies

The three-principle methodology of LoKA is applicable to other numerically sensitive AI workloads such as scientific computing and financial modeling. As hardware support for lower-precision formats like FP4 and INT4 matures, its statistics-driven method will provide references for the implementation of new technologies. Future AI systems may adopt mixed-precision strategies, and LoKA is a pioneering exploration of this trend.