Performance data reveals some interesting phenomena. The f16 version of fuchat is actually about twice as slow as the f32 version, which is counterintuitive—usually half-precision computation should be faster. Developers speculate that this may be related to the level of optimization of the f16 type by the Futhark compiler, or changes in GPU memory access patterns.
More noteworthy is the performance improvement brought by KV caching. Before implementing KV caching, the pure f32 version had an inference speed of only 2-5 tokens/s. After introducing Futhark's "update in-place" mechanism, the performance improved by 5 to 10 times. This proves the effectiveness of the uniqueness typing system in functional languages when handling state-intensive computations.
For comparison, on the same hardware, llama.cpp can reach about 150 tokens/s using the f16 quantized model and about 110 tokens/s using the f32 quantized model. Fuchat still has a significant gap, but considering that this is a single-file, type-safe pure Futhark implementation, 25 tokens/s is already an impressive starting point.