# Lance-Quant: 4-bit Quantization Toolkit for ByteDance's Lance Multimodal Model

> A customized 4-bit quantization solution for ByteDance's Lance multimodal large model, supporting both AWQ INT4 and NVFP4 formats. It achieves high-quality compression via task-aware calibration, reducing a 24.7GB model to 4.3GB.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-20T23:13:54.000Z
- 最近活动: 2026-05-20T23:21:27.324Z
- 热度: 163.9
- 关键词: quantization, AWQ, INT4, NVFP4, multimodal, Lance, ByteDance, LLM, model compression, MoE
- 页面链接: https://www.zingnex.cn/en/forum/thread/lance-quant-lance4-bit
- Canonical: https://www.zingnex.cn/forum/thread/lance-quant-lance4-bit
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Lance-Quant: 4-bit Quantization Toolkit for ByteDance's Lance Multimodal Model

A customized 4-bit quantization solution for ByteDance's Lance multimodal large model, supporting both AWQ INT4 and NVFP4 formats. It achieves high-quality compression via task-aware calibration, reducing a 24.7GB model to 4.3GB.

## Project Background: Why Does Lance Need Special Quantization?

Lance adopts a unique architectural design—based on the modified Qwen2.5-VL, it introduces parallel `_moe_gen` expert modules in each Transformer layer, implementing a "Mixture-of-Tasks" routing mechanism: understanding tokens flow through one expert, while generation tokens flow through another.

This architecture poses quantization challenges:

1. **Architectural Specificity**: Standard quantization tools like AWQ and AutoAWQ cannot recognize Lance's custom `PreTrainedModel` architecture.
2. **Routing Complexity**: Simple x2t (image-to-text) calibration misses `_moe_gen` weights, leading to severe quality degradation in the generation path after quantization.
3. **Runtime Compatibility**: Inference engines like vLLM and TensorRT-LLM do not yet support the Lance architecture.

lance-quant solves all the above issues through manually implemented calibration, packaging, and runtime replacement solutions.

## Calibration Phase: Task-Aware Data Collection

Unlike standard AWQ, lance-quant uses a **dual-task calibration strategy**:

| Script | Function |
|------|------|
| `awq_calibrate_single.py` | Runs Lance inference on a single task, implants activation hooks on 504 target Linear layers (`q/k/v/o_proj`, `mlp.{gate,up,down}_proj`, and each `_moe_gen` sibling layer), and saves per-channel average absolute activation magnitude.
| `awq_merge_stats.py` | Merges statistics from multiple tasks into a single calibration set.

Key Insight: Pure x2t calibration leaves `_moe_gen` weights without activation data, causing AWQ to fall back to simple min-max quantization—this is the root cause of "gibberish" outputs. By adding t2i (text-to-image) routing, activation data flows through the generation path, allowing AWQ to compute appropriate scaling factors for these layers.

## Quantization Application: Grid Search & Grouping Strategy

| Script | Output Format | Description |
|------|---------|------|
| `awq_apply.py` | INT4 | Performs grid search for AWQ scaling balance on normalized + consumer linear layers, fuses scaling factors into the preceding RMSNorm, and packs weights into INT4 by group.
| `nvfp4_apply.py` | NVFP4 | Uses the same calibration data but packs into NVFP4 format (E2M1 encoding + FP8 E4M3 per 16-element block scaling), suitable for Blackwell tensor cores.

## Runtime Replacement & Memory Optimization

| Script/Module | Function |
|-----------|------|
| `run_baseline.py` | bf16 baseline inference with a memory-optimized loader (meta initialization + streaming bf16 conversion), enabling a 12.3GB bf16 model to run on a 16GB GPU.
| `run_quant_eval.py` | Replaces Linear layers with `WQLinearINT4`/`WQLinearNVFP4` and runs comparative evaluation.
| `quantized_linear.py` | A pure PyTorch reference module supporting on-demand dequantization for correctness verification.
| `comfyui/` | ComfyUI custom node package that automatically detects the Lance source.

## Full Multimodal Version (Recommended for Production)

Retains Lance's MoE routing, supporting image/video generation + understanding:

| Variant | Original Size | Quantized Size | Compression Ratio |
|------|---------|--------|-------|
| Lance-3B-AWQ-INT4 | 24.7 GB | **4.31 GB** | 5.7x |
| Lance-3B-Video-AWQ-INT4 | 28.4 GB | **6.15 GB** | 4.6x |
| Lance-3B-NVFP4 (Blackwell) | 24.7 GB | **5.09 GB** | 4.9x |
| Lance-3B-Video-NVFP4 | 28.4 GB | **6.93 GB** | 4.1x |

## Apple Silicon Special Version (Understanding Path Only)

Extracts the understanding path of the standard Qwen2 architecture for Apple Silicon/iOS deployment:

| Variant | Size | Description |
|------|------|------|
| Lance-3B-und-MLX-4bit-DWQ | 1.6 GB | Recommended (distilled scaling)
| Lance-3B-und-MLX-4bit | 1.6 GB | Pure post-training quantization
| Lance-3B-und-MLX-NVFP4 | 1.6 GB | Future ANE acceleration
| Lance-3B-und-CoreML-palettized | 6.2 GB fp16 | iOS/ANE pipeline |

## v2 Improvements: group_size=64 Fixes Long Text Drift

The v1 version used `group_size=128` and only achieved **33% exact match** on the 6-sample x2t image benchmark. A typical case shows classic AWQ long text degradation: the model incorrectly inserted a fictional entity ("Scott Levin and his family") in the question about "1998 promotion campaign costs".

v2 re-quantization uses `group_size=64`:

- **Same calibration data, same recipe, only finer granularity**
- Quality jumps to **50% exact match**
- Case 4 matches the baseline exactly: "According to market research data, total spending on promotion meetings and activities in 1998 was approximately 1.3 billion US dollars"

Fix Principle: `o_proj` and `down_proj` cannot fuse AWQ scaling into the preceding norm (post-nonlinearity), so they use pure per-group quantization. Smaller groups = fewer outliers competing for the same scaling = lower per-channel quantization noise.
