Zing Forum

Reading

Lance-Quant: 4-bit Quantization Toolkit for ByteDance's Lance Multimodal Model

A customized 4-bit quantization solution for ByteDance's Lance multimodal large model, supporting both AWQ INT4 and NVFP4 formats. It achieves high-quality compression via task-aware calibration, reducing a 24.7GB model to 4.3GB.

quantizationAWQINT4NVFP4multimodalLanceByteDanceLLMmodel compressionMoE
Published 2026-05-21 07:13Recent activity 2026-05-21 07:21Estimated read 8 min
Lance-Quant: 4-bit Quantization Toolkit for ByteDance's Lance Multimodal Model
1

Section 01

Introduction / Main Floor: Lance-Quant: 4-bit Quantization Toolkit for ByteDance's Lance Multimodal Model

A customized 4-bit quantization solution for ByteDance's Lance multimodal large model, supporting both AWQ INT4 and NVFP4 formats. It achieves high-quality compression via task-aware calibration, reducing a 24.7GB model to 4.3GB.

2

Section 02

Project Background: Why Does Lance Need Special Quantization?

Lance adopts a unique architectural design—based on the modified Qwen2.5-VL, it introduces parallel _moe_gen expert modules in each Transformer layer, implementing a "Mixture-of-Tasks" routing mechanism: understanding tokens flow through one expert, while generation tokens flow through another.

This architecture poses quantization challenges:

  1. Architectural Specificity: Standard quantization tools like AWQ and AutoAWQ cannot recognize Lance's custom PreTrainedModel architecture.
  2. Routing Complexity: Simple x2t (image-to-text) calibration misses _moe_gen weights, leading to severe quality degradation in the generation path after quantization.
  3. Runtime Compatibility: Inference engines like vLLM and TensorRT-LLM do not yet support the Lance architecture.

lance-quant solves all the above issues through manually implemented calibration, packaging, and runtime replacement solutions.

3

Section 03

Calibration Phase: Task-Aware Data Collection

Unlike standard AWQ, lance-quant uses a dual-task calibration strategy:

Script Function
awq_calibrate_single.py Runs Lance inference on a single task, implants activation hooks on 504 target Linear layers (q/k/v/o_proj, mlp.{gate,up,down}_proj, and each _moe_gen sibling layer), and saves per-channel average absolute activation magnitude.
awq_merge_stats.py Merges statistics from multiple tasks into a single calibration set.

Key Insight: Pure x2t calibration leaves _moe_gen weights without activation data, causing AWQ to fall back to simple min-max quantization—this is the root cause of "gibberish" outputs. By adding t2i (text-to-image) routing, activation data flows through the generation path, allowing AWQ to compute appropriate scaling factors for these layers.

4

Section 04

Quantization Application: Grid Search & Grouping Strategy

Script Output Format Description
awq_apply.py INT4 Performs grid search for AWQ scaling balance on normalized + consumer linear layers, fuses scaling factors into the preceding RMSNorm, and packs weights into INT4 by group.
nvfp4_apply.py NVFP4 Uses the same calibration data but packs into NVFP4 format (E2M1 encoding + FP8 E4M3 per 16-element block scaling), suitable for Blackwell tensor cores.
5

Section 05

Runtime Replacement & Memory Optimization

Script/Module Function
run_baseline.py bf16 baseline inference with a memory-optimized loader (meta initialization + streaming bf16 conversion), enabling a 12.3GB bf16 model to run on a 16GB GPU.
run_quant_eval.py Replaces Linear layers with WQLinearINT4/WQLinearNVFP4 and runs comparative evaluation.
quantized_linear.py A pure PyTorch reference module supporting on-demand dequantization for correctness verification.
comfyui/ ComfyUI custom node package that automatically detects the Lance source.
6

Section 06

Full Multimodal Version (Recommended for Production)

Retains Lance's MoE routing, supporting image/video generation + understanding:

Variant Original Size Quantized Size Compression Ratio
Lance-3B-AWQ-INT4 24.7 GB 4.31 GB 5.7x
Lance-3B-Video-AWQ-INT4 28.4 GB 6.15 GB 4.6x
Lance-3B-NVFP4 (Blackwell) 24.7 GB 5.09 GB 4.9x
Lance-3B-Video-NVFP4 28.4 GB 6.93 GB 4.1x
7

Section 07

Apple Silicon Special Version (Understanding Path Only)

Extracts the understanding path of the standard Qwen2 architecture for Apple Silicon/iOS deployment:

Variant Size Description
Lance-3B-und-MLX-4bit-DWQ 1.6 GB Recommended (distilled scaling)
Lance-3B-und-MLX-4bit 1.6 GB Pure post-training quantization
Lance-3B-und-MLX-NVFP4 1.6 GB Future ANE acceleration
Lance-3B-und-CoreML-palettized 6.2 GB fp16 iOS/ANE pipeline
8

Section 08

v2 Improvements: group_size=64 Fixes Long Text Drift

The v1 version used group_size=128 and only achieved 33% exact match on the 6-sample x2t image benchmark. A typical case shows classic AWQ long text degradation: the model incorrectly inserted a fictional entity ("Scott Levin and his family") in the question about "1998 promotion campaign costs".

v2 re-quantization uses group_size=64:

  • Same calibration data, same recipe, only finer granularity
  • Quality jumps to 50% exact match
  • Case 4 matches the baseline exactly: "According to market research data, total spending on promotion meetings and activities in 1998 was approximately 1.3 billion US dollars"

Fix Principle: o_proj and down_proj cannot fuse AWQ scaling into the preceding norm (post-nonlinearity), so they use pure per-group quantization. Smaller groups = fewer outliers competing for the same scaling = lower per-channel quantization noise.