Section 01
Ternary-Zero: 2-bit Quantization Makes Large Models Fly on Consumer GPUs (Introduction)
Ternary-Zero is a groundbreaking open-source LLM inference acceleration framework. Its core innovation lies in using 2-bit ternary quantization technology to achieve 8x weight compression, solving the memory bottleneck problem during large model inference. This allows a 70-billion parameter model, which originally requires over 140GB of VRAM, to run efficiently on a single consumer-grade RTX 4090 (24GB VRAM). The framework is compatible with PyTorch, supports Hugging Face model integration, and also provides quantization-aware training capabilities.