Reading

Lance-Quant: 4-bit Quantization Toolkit for ByteDance's Lance Multimodal Model

A customized 4-bit quantization solution for ByteDance's Lance multimodal large model, supporting both AWQ INT4 and NVFP4 formats. It achieves high-quality compression via task-aware calibration, reducing a 24.7GB model to 4.3GB.

quantizationAWQINT4NVFP4multimodalLanceByteDanceLLMmodel compressionMoE

Published 2026-05-21 07:13Recent activity 2026-05-21 07:21Estimated read 8 min

Section 01

Introduction / Main Floor: Lance-Quant: 4-bit Quantization Toolkit for ByteDance's Lance Multimodal Model

Section 02

Project Background: Why Does Lance Need Special Quantization?

Lance adopts a unique architectural design—based on the modified Qwen2.5-VL, it introduces parallel _moe_gen expert modules in each Transformer layer, implementing a "Mixture-of-Tasks" routing mechanism: understanding tokens flow through one expert, while generation tokens flow through another.

This architecture poses quantization challenges:

Architectural Specificity: Standard quantization tools like AWQ and AutoAWQ cannot recognize Lance's custom PreTrainedModel architecture.
Routing Complexity: Simple x2t (image-to-text) calibration misses _moe_gen weights, leading to severe quality degradation in the generation path after quantization.
Runtime Compatibility: Inference engines like vLLM and TensorRT-LLM do not yet support the Lance architecture.

lance-quant solves all the above issues through manually implemented calibration, packaging, and runtime replacement solutions.

Section 03

Calibration Phase: Task-Aware Data Collection

Unlike standard AWQ, lance-quant uses a dual-task calibration strategy:

Script	Function
`awq_calibrate_single.py`	Runs Lance inference on a single task, implants activation hooks on 504 target Linear layers (`q/k/v/o_proj`, `mlp.{gate,up,down}_proj`, and each `_moe_gen` sibling layer), and saves per-channel average absolute activation magnitude.
`awq_merge_stats.py`	Merges statistics from multiple tasks into a single calibration set.

Key Insight: Pure x2t calibration leaves _moe_gen weights without activation data, causing AWQ to fall back to simple min-max quantization—this is the root cause of "gibberish" outputs. By adding t2i (text-to-image) routing, activation data flows through the generation path, allowing AWQ to compute appropriate scaling factors for these layers.

Section 04

Quantization Application: Grid Search & Grouping Strategy

Script	Output Format	Description
`awq_apply.py`	INT4	Performs grid search for AWQ scaling balance on normalized + consumer linear layers, fuses scaling factors into the preceding RMSNorm, and packs weights into INT4 by group.
`nvfp4_apply.py`	NVFP4	Uses the same calibration data but packs into NVFP4 format (E2M1 encoding + FP8 E4M3 per 16-element block scaling), suitable for Blackwell tensor cores.

Section 05

Runtime Replacement & Memory Optimization

Script/Module	Function
`run_baseline.py`	bf16 baseline inference with a memory-optimized loader (meta initialization + streaming bf16 conversion), enabling a 12.3GB bf16 model to run on a 16GB GPU.
`run_quant_eval.py`	Replaces Linear layers with `WQLinearINT4`/`WQLinearNVFP4` and runs comparative evaluation.
`quantized_linear.py`	A pure PyTorch reference module supporting on-demand dequantization for correctness verification.
`comfyui/`	ComfyUI custom node package that automatically detects the Lance source.

Section 06

Full Multimodal Version (Recommended for Production)

Retains Lance's MoE routing, supporting image/video generation + understanding:

Variant	Original Size	Quantized Size	Compression Ratio
Lance-3B-AWQ-INT4	24.7 GB	4.31 GB	5.7x
Lance-3B-Video-AWQ-INT4	28.4 GB	6.15 GB	4.6x
Lance-3B-NVFP4 (Blackwell)	24.7 GB	5.09 GB	4.9x
Lance-3B-Video-NVFP4	28.4 GB	6.93 GB	4.1x

Section 07

Apple Silicon Special Version (Understanding Path Only)

Extracts the understanding path of the standard Qwen2 architecture for Apple Silicon/iOS deployment:

Variant	Size	Description
Lance-3B-und-MLX-4bit-DWQ	1.6 GB	Recommended (distilled scaling)
Lance-3B-und-MLX-4bit	1.6 GB	Pure post-training quantization
Lance-3B-und-MLX-NVFP4	1.6 GB	Future ANE acceleration
Lance-3B-und-CoreML-palettized	6.2 GB fp16	iOS/ANE pipeline

Section 08

v2 Improvements: group_size=64 Fixes Long Text Drift

The v1 version used group_size=128 and only achieved 33% exact match on the 6-sample x2t image benchmark. A typical case shows classic AWQ long text degradation: the model incorrectly inserted a fictional entity ("Scott Levin and his family") in the question about "1998 promotion campaign costs".

v2 re-quantization uses group_size=64:

Same calibration data, same recipe, only finer granularity
Quality jumps to 50% exact match
Case 4 matches the baseline exactly: "According to market research data, total spending on promotion meetings and activities in 1998 was approximately 1.3 billion US dollars"

Fix Principle: o_proj and down_proj cannot fuse AWQ scaling into the preceding norm (post-nonlinearity), so they use pure per-group quantization. Smaller groups = fewer outliers competing for the same scaling = lower per-channel quantization noise.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15