Reading

Chiquito: Run Large Models on Consumer GPUs with RAM Preloading

Chiquito enables smooth operation of large language models (LLMs) on devices with limited VRAM through layer-wise inference and RAM preloading techniques. Compared to reading layers from disk one by one, RAM preloading can increase inference speed by 2-5 times.

LLM推理优化显存优化内存预加载边缘计算HuggingFace量化消费级硬件

Published 2026-04-06 04:40Recent activity 2026-04-06 04:51Estimated read 6 min

Section 01

Introduction / Main Floor: Chiquito: Run Large Models on Consumer GPUs with RAM Preloading

Section 02

Background: VRAM Bottlenecks Plague Local LLM Deployment

As the parameter size of large language models continues to grow, consumer GPUs (e.g., RTX 2080 with 8GB VRAM) can hardly load complete LLMs directly. Even a 7B-parameter model requires about 14GB of VRAM in fp16 precision, far exceeding the capacity of ordinary gaming GPUs.

Traditional solutions either rely on cloud APIs (sacrificing privacy and autonomy) or use quantization techniques (which may lose precision). The Chiquito project offers a different path: through layer-wise inference and system RAM preloading, it enables large models to run on consumer hardware while maintaining precision.

Section 03

Project Overview: What is Chiquito?

Chiquito is a lightweight reimplementation inspired by AirLLM, designed specifically for machines with limited VRAM but sufficient RAM. Its core ideas are simple:

Layer-wise inference: Load only one model layer onto the GPU at a time, release it immediately after forward propagation
RAM preloading: Preload all layer weights into system RAM (instead of reading from disk each time)
Sliding window: For extra-large models, use a sliding window mode to keep N layers in memory permanently, with background threads preloading subsequent layers asynchronously

This design makes PCIe transfer (RAM → GPU) the bottleneck instead of disk I/O, and the former is 2-5 times faster than the latter.

Section 04

Core Mechanism: Three Operation Modes

Chiquito provides flexible configuration options to adapt to different hardware conditions:

Section 05

Mode 1: Full Preloading (preload_to_ram=True)

Suitable for scenarios where the model can fit entirely into system RAM. During initialization, the entire model is split into separate .safetensors files per layer and loaded into RAM. During inference, data is copied directly from RAM to GPU, making this the fastest mode.

Section 06

Mode 2: Sliding Window (preload_to_ram=N)

Suitable for scenarios where the model exceeds available RAM. Only N layers are kept in memory, and background threads continuously preload upcoming layers. As long as disk I/O can keep up with GPU computing speed, there will be no pauses.

Section 07

Mode 3: Disk Fallback (preload_to_ram=False)

Minimum memory usage mode; layer weights are read from disk each time. This is the slowest mode but can run in extremely low-memory environments.

Section 08

Performance Test: Let the Data Speak

The project author conducted tests in an environment with Intel Core i9-10980HK + 64GB RAM + RTX 2080 Super (8GB VRAM):

Small model (TinyLlama-1.1B):

Full preloading load time: 7.91s, time to generate 20 tokens:55.10s
Disk mode load time:1.74s, generation time:54.58s
The difference is not obvious due to the small model size

Medium model (Qwen2.5-Coder-32B):

Full preloading time to generate20 tokens:361.67s
Disk mode time:391.50s
Preloading mode is about8% faster, thanks to DMA transfer optimization

Large model (65GB fp16):

Exceeds 64GB RAM, cannot use full preloading
Sliding window mode (5/10/34 layers) has performance close to disk mode
Verifies that background preloading can effectively hide disk latency

In addition, Chiquito supports 4-bit/8-bit quantization from bitsandbytes, which can compress a32B model from65GB to about16GB (4-bit), further lowering the memory threshold.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15