The project author conducted tests in an environment with Intel Core i9-10980HK + 64GB RAM + RTX 2080 Super (8GB VRAM):
Small model (TinyLlama-1.1B):
- Full preloading load time: 7.91s, time to generate 20 tokens:55.10s
- Disk mode load time:1.74s, generation time:54.58s
- The difference is not obvious due to the small model size
Medium model (Qwen2.5-Coder-32B):
- Full preloading time to generate20 tokens:361.67s
- Disk mode time:391.50s
- Preloading mode is about8% faster, thanks to DMA transfer optimization
Large model (65GB fp16):
- Exceeds 64GB RAM, cannot use full preloading
- Sliding window mode (5/10/34 layers) has performance close to disk mode
- Verifies that background preloading can effectively hide disk latency
In addition, Chiquito supports 4-bit/8-bit quantization from bitsandbytes, which can compress a32B model from65GB to about16GB (4-bit), further lowering the memory threshold.