Zing Forum

Reading

BigSmall: Lossless Neural Network Weight Compression, Enabling Large Models to Run Smoothly on Small Memory

BigSmall reduces the size of large language models by 65-83% using lossless compression technology. Combined with a streaming loader, it achieves peak memory usage below 2GB, allowing users to run complete models on consumer-grade hardware without quantization.

神经网络压缩大语言模型无损压缩模型推理内存优化HuggingFace量化替代流式加载AI部署PyTorch
Published 2026-05-19 01:42Recent activity 2026-05-19 01:52Estimated read 7 min
BigSmall: Lossless Neural Network Weight Compression, Enabling Large Models to Run Smoothly on Small Memory
1

Section 01

Introduction / Main Floor: BigSmall: Lossless Neural Network Weight Compression, Enabling Large Models to Run Smoothly on Small Memory

BigSmall reduces the size of large language models by 65-83% using lossless compression technology. Combined with a streaming loader, it achieves peak memory usage below 2GB, allowing users to run complete models on consumer-grade hardware without quantization.

2

Section 02

Problem Background: Hardware Dilemma in the Era of Large Models

When you want to run a large language model like Mistral 7B, you first face a harsh reality: the model requires 14GB of VRAM, but your laptop only has 8GB. The traditional solution is quantization—compressing the model to 4-bit precision. However, the problem is that the quantized model is no longer the original one.

Every weight is permanently degraded, output quality drops, fine-tuning causes drift, and reproducibility becomes impossible. For research, production, or any scenario requiring reliable results, quantization is a compromise you have to accept.

The emergence of BigSmall changes this situation.

3

Section 03

Core Innovation: Truly Lossless Compression

BigSmall is not quantization. Every weight after decompression is bit-level consistent with the original model, and each tensor is verified via MD5. You get the complete original model—always.

4

Section 04

Compression Effect Comparison

Model Original Size After Compression Compression Rate
Mistral 7B Instruct v0.3 14.2 GB 9.3 GB 65.6%
Llama 3.1 8B 15.0 GB 9.9 GB 65.7%
Qwen 2.5 14B 28.6 GB 18.8 GB 65.8%
Stable Diffusion 1.5 UNet 1.72 GB 1.48 GB 85.9%
GPT-2 117M (FP32) 548 MB 414 MB 75.5%

For models in FP32 format, the compression rate can reach 75-83%, which is particularly important for research scenarios requiring high-precision floating-point operations.

5

Section 05

Streaming Loader: Breaking the Memory Bottleneck

The most revolutionary feature of BigSmall is its streaming loader. Traditional loading methods require loading the entire model into memory at once, while the streaming loader decompresses one layer at a time, directly sends it to VRAM, and immediately releases the memory of the previous layer.

This means:

  • Peak memory usage below 2GB—regardless of model size
  • No need to reserve space for the complete model—decompression and inference proceed synchronously
  • Supports models of any size—even 70B models can run on consumer-grade hardware

Comparative tests show that on GPT-2, the peak memory of streaming loading is 29.6% lower than full loading. For 70B-level large models, this gap will reach dozens of GB.

6

Section 06

Essential Differences from Quantization Solutions

Many people may ask: Why not just use 4-bit quantization? The answer lies in the chain of advantages brought by the word "lossless":

Feature 4-bit Quantization BigSmall
Lossless? No—weights permanently degraded Yes—bit-level consistent
Mistral 7B size ~4 GB 9 GB
Peak loading memory ~4 GB < 2 GB
Inference speed Slower on some hardware Native speed
Fine-tuning safety No—baseline drift Yes—clean weights
Output reproducibility No Yes
FP32 support No Yes

Quantization sacrifices model quality, while BigSmall sacrifices storage space—but in this era of cheap storage, this is a wiser trade-off.

7

Section 07

BigSmall vs DFloat11

DFloat11 is another well-known neural network compression project, but the two have different design philosophies:

Feature BigSmall DFloat11
Compression rate (BF16) 65-66% ~70%
Compression rate (FP32) 75-83% BF16 only
Inference overhead None—decompress during loading ~2x slower (batch=1)
Hardware support CPU, Apple Silicon, AMD, any GPU CUDA only
Fine-tuning safety Yes—fine-tune after decompression No—keep compressed
vLLM compatible Yes Custom engine only
Peak memory (streaming) <2GB Requires full model VRAM

DFloat11 remains compressed during inference, requiring decompression for each forward pass, which brings continuous performance overhead. BigSmall chooses to decompress once and then run at native speed.

8

Section 08

BigSmall vs ZipNN

ZipNN is another lossless compression solution; both are based on the same mathematical principles, but BigSmall leads in ease of use and ecosystem:

Feature BigSmall ZipNN
Compression rate (BF16) 65-66% ~67%
Compression rate (FP32) 75-83% ~83%
FP32/FP16/FP8/FP4 support All Mainly BF16
Streaming loader Yes—peak <2GB No
HuggingFace pre-compressed models 21+ 5