Lance adopts a unique architectural design—based on the modified Qwen2.5-VL, it introduces parallel _moe_gen expert modules in each Transformer layer, implementing a "Mixture-of-Tasks" routing mechanism: understanding tokens flow through one expert, while generation tokens flow through another.
This architecture poses quantization challenges:
- Architectural Specificity: Standard quantization tools like AWQ and AutoAWQ cannot recognize Lance's custom
PreTrainedModel architecture.
- Routing Complexity: Simple x2t (image-to-text) calibration misses
_moe_gen weights, leading to severe quality degradation in the generation path after quantization.
- Runtime Compatibility: Inference engines like vLLM and TensorRT-LLM do not yet support the Lance architecture.
lance-quant solves all the above issues through manually implemented calibration, packaging, and runtime replacement solutions.