Section 01
[Main Floor/Introduction] llama-tq: A Technical Breakthrough Enabling Consumer GPUs to Support 35B Models with 100k Token Long Contexts
The llama-tq project, through two core innovations—TurboQuant KV cache quantization and sparse fine-tuning technology for hybrid MoE+SSM architectures—successfully enables running 35B-scale models with 100k token context on a single 12GB GPU while maintaining f16-level generation quality, solving the memory bottleneck issue of long contexts in large model deployment.