The bottleneck of LLM inference lies in matrix-vector multiplication, which is the most frequent operation in the Transformer architecture. NanoLlama deeply explores the AVX2 instruction set of Intel/AMD processors and implements manually optimized vectorized computing.
The specific implementations include:
256-bit register parallel processing: Using __m256 type SIMD registers, it can process 8 single-precision floating-point numbers (float32) or 16 half-precision floating-point numbers (float16) at a time. Compared to scalar operations, the theoretical speedup can reach 8-16 times.
FMA fused multiply-add instruction: Modern CPUs support Fused Multiply-Add operations, which can complete the calculation of a * b + c in a single clock cycle. NanoLlama fully utilizes this feature, merging matrix multiplication and bias addition into one instruction, reducing the number of operation cycles by about one-third.
OpenMP multi-threading parallelism: Through OpenMP compilation directives, computing tasks are automatically distributed to all available CPU cores. Each core independently processes different parts of the tensor, achieving nearly linear multi-core scaling.