Zing Forum

Reading

VIA-SD: A New Paradigm for Speculative Decoding with Hierarchical Verification via In-Model Routing

VIA-SD proposes a three-level speculative decoding framework that assigns verification tasks to lightweight sub-models for medium-confidence tokens via in-model routing. It increases inference speed by 10-20% while maintaining output quality, and achieves 2.5-3x acceleration compared to non-speculative decoding.

speculative decodingLLM inferencemodel routingefficiencyverification
Published 2026-06-10 23:45Recent activity 2026-06-11 11:48Estimated read 6 min
VIA-SD: A New Paradigm for Speculative Decoding with Hierarchical Verification via In-Model Routing
1

Section 01

VIA-SD: Introduction to the New Paradigm of Hierarchical Verification Speculative Decoding

Key Information about VIA-SD

  • Source: arXiv (published on June 10, 2026), original paper link: http://arxiv.org/abs/2606.12243v1
  • Author Team: Paper author team, project homepage: https://zju-xyc.github.io/VIA-SD-Project-Page/
  • Core Innovation: Proposes a three-level speculative decoding framework that assigns verification tasks to lightweight sub-models for medium-confidence tokens via in-model routing
  • Performance: Increases inference speed by 10-20% while maintaining output quality, and achieves 2.5-3x acceleration compared to non-speculative decoding

This technology breaks the binary decision limitation of traditional speculative decoding and provides a new paradigm for large model inference acceleration.

2

Section 02

Background: The Binary Decision Dilemma in Large Model Inference Acceleration

As LLM parameter scales expand, inference cost becomes a deployment bottleneck. Speculative Decoding (SD) improves throughput by generating candidates with a draft model and verifying them in parallel with a verification model, but traditional SD uses a binary decision mechanism:

  • Either fully accept candidate tokens or completely reject them and recompute
  • A large number of medium-confidence tokens are rejected and require calling the full large model, leading to a waste of computing resources

This "one-size-fits-all" strategy restricts the efficiency improvement of SD.

3

Section 03

VIA-SD's Three-Level Architecture and In-Model Routing Technology

Three-Level Verification Architecture

  1. High-confidence tokens: Directly accepted without additional verification
  2. Medium-confidence tokens: Activate lightweight verifiers (slim-verifiers) derived from the main model for processing
  3. Low-confidence tokens: Call the full verification model for verification

Advantages of In-Model Routing Design

  • Lightweight verifiers share parameters with the main model, no additional storage overhead
  • Inherit the main model's knowledge, avoiding knowledge gaps of independent small models
  • Seamlessly integrate with existing SD frameworks without modifying training processes or architectures

This design achieves refined allocation of computing resources.

4

Section 04

Experimental Verification: Significant Performance Improvement Data

Experimental results on four representative tasks:

  • Reduced Rejection Rate: Token rejection rate decreases by 0.10-0.22, more candidate tokens are effectively utilized
  • Relative Acceleration: Achieves an additional 10-20% acceleration compared to strong baseline SD methods
  • Absolute Acceleration: Achieves 2.5-3x inference acceleration compared to non-speculative decoding

This verifies the actual performance gains of the three-level strategy.

5

Section 05

Compatibility Advantages and Technical Significance

Compatibility

VIA-SD can be directly applied to already trained SD systems without retraining draft/verification models, allowing engineers to deploy quickly and gain performance improvements.

Technical Significance

VIA-SD marks the evolution of speculative decoding from "binary decision" to "multi-level refined verification", revealing that inference acceleration requires intelligent allocation of computing resources during the verification phase.

6

Section 06

Insights and Future Directions

The idea of VIA-SD provides references for large model inference optimization:

  • Future can explore schemes based on confidence stratification and dynamic resource scheduling
  • Promote efficient deployment of large models in edge devices, real-time interaction, and other scenarios

Core insight: Efficiency improvement does not lie in increasing computation, but in smarter allocation of existing computing resources.