Section 01
[Introduction] Core of Neural Network Quantization Research: Precision and Performance Trade-offs from FP32 to INT8/INT4
This research is a 50-day in-depth project focusing on the quantization process of neural networks from FP32 to INT8/INT4. By implementing custom CUDA kernels and comparing with NVIDIA TensorRT, it explores the trade-off between precision loss and performance improvement, providing practical references for TensorRT and inference teams. The study covers real models such as ResNet-18 and DistilBERT, including systematic benchmarking and layer-wise sensitivity analysis.