Zing Forum

Reading

Chiquito: Run Large Models on Consumer GPUs with RAM Preloading

Chiquito enables smooth operation of large language models (LLMs) on devices with limited VRAM through layer-wise inference and RAM preloading techniques. Compared to reading layers from disk one by one, RAM preloading can increase inference speed by 2-5 times.

LLM推理优化显存优化内存预加载边缘计算HuggingFace量化消费级硬件
Published 2026-04-06 04:40Recent activity 2026-04-06 04:51Estimated read 6 min
Chiquito: Run Large Models on Consumer GPUs with RAM Preloading
1

Section 01

Introduction / Main Floor: Chiquito: Run Large Models on Consumer GPUs with RAM Preloading

Chiquito enables smooth operation of large language models (LLMs) on devices with limited VRAM through layer-wise inference and RAM preloading techniques. Compared to reading layers from disk one by one, RAM preloading can increase inference speed by 2-5 times.

2

Section 02

Background: VRAM Bottlenecks Plague Local LLM Deployment

As the parameter size of large language models continues to grow, consumer GPUs (e.g., RTX 2080 with 8GB VRAM) can hardly load complete LLMs directly. Even a 7B-parameter model requires about 14GB of VRAM in fp16 precision, far exceeding the capacity of ordinary gaming GPUs.

Traditional solutions either rely on cloud APIs (sacrificing privacy and autonomy) or use quantization techniques (which may lose precision). The Chiquito project offers a different path: through layer-wise inference and system RAM preloading, it enables large models to run on consumer hardware while maintaining precision.

3

Section 03

Project Overview: What is Chiquito?

Chiquito is a lightweight reimplementation inspired by AirLLM, designed specifically for machines with limited VRAM but sufficient RAM. Its core ideas are simple:

  1. Layer-wise inference: Load only one model layer onto the GPU at a time, release it immediately after forward propagation
  2. RAM preloading: Preload all layer weights into system RAM (instead of reading from disk each time)
  3. Sliding window: For extra-large models, use a sliding window mode to keep N layers in memory permanently, with background threads preloading subsequent layers asynchronously

This design makes PCIe transfer (RAM → GPU) the bottleneck instead of disk I/O, and the former is 2-5 times faster than the latter.

4

Section 04

Core Mechanism: Three Operation Modes

Chiquito provides flexible configuration options to adapt to different hardware conditions:

5

Section 05

Mode 1: Full Preloading (preload_to_ram=True)

Suitable for scenarios where the model can fit entirely into system RAM. During initialization, the entire model is split into separate .safetensors files per layer and loaded into RAM. During inference, data is copied directly from RAM to GPU, making this the fastest mode.

6

Section 06

Mode 2: Sliding Window (preload_to_ram=N)

Suitable for scenarios where the model exceeds available RAM. Only N layers are kept in memory, and background threads continuously preload upcoming layers. As long as disk I/O can keep up with GPU computing speed, there will be no pauses.

7

Section 07

Mode 3: Disk Fallback (preload_to_ram=False)

Minimum memory usage mode; layer weights are read from disk each time. This is the slowest mode but can run in extremely low-memory environments.

8

Section 08

Performance Test: Let the Data Speak

The project author conducted tests in an environment with Intel Core i9-10980HK + 64GB RAM + RTX 2080 Super (8GB VRAM):

Small model (TinyLlama-1.1B):

  • Full preloading load time: 7.91s, time to generate 20 tokens:55.10s
  • Disk mode load time:1.74s, generation time:54.58s
  • The difference is not obvious due to the small model size

Medium model (Qwen2.5-Coder-32B):

  • Full preloading time to generate20 tokens:361.67s
  • Disk mode time:391.50s
  • Preloading mode is about8% faster, thanks to DMA transfer optimization

Large model (65GB fp16):

  • Exceeds 64GB RAM, cannot use full preloading
  • Sliding window mode (5/10/34 layers) has performance close to disk mode
  • Verifies that background preloading can effectively hide disk latency

In addition, Chiquito supports 4-bit/8-bit quantization from bitsandbytes, which can compress a32B model from65GB to about16GB (4-bit), further lowering the memory threshold.