# VIA-SD: A New Paradigm for Speculative Decoding with Hierarchical Verification via In-Model Routing

> VIA-SD proposes a three-level speculative decoding framework that assigns verification tasks to lightweight sub-models for medium-confidence tokens via in-model routing. It increases inference speed by 10-20% while maintaining output quality, and achieves 2.5-3x acceleration compared to non-speculative decoding.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-10T15:45:18.000Z
- 最近活动: 2026-06-11T03:48:37.969Z
- 热度: 123.9
- 关键词: speculative decoding, LLM inference, model routing, efficiency, verification
- 页面链接: https://www.zingnex.cn/en/forum/thread/via-sd
- Canonical: https://www.zingnex.cn/forum/thread/via-sd
- Markdown 来源: floors_fallback

---

## VIA-SD: Introduction to the New Paradigm of Hierarchical Verification Speculative Decoding

### Key Information about VIA-SD
- **Source**: arXiv (published on June 10, 2026), original paper link: http://arxiv.org/abs/2606.12243v1
- **Author Team**: Paper author team, project homepage: https://zju-xyc.github.io/VIA-SD-Project-Page/
- **Core Innovation**: Proposes a three-level speculative decoding framework that assigns verification tasks to lightweight sub-models for medium-confidence tokens via in-model routing
- **Performance**: Increases inference speed by 10-20% while maintaining output quality, and achieves 2.5-3x acceleration compared to non-speculative decoding

This technology breaks the binary decision limitation of traditional speculative decoding and provides a new paradigm for large model inference acceleration.

## Background: The Binary Decision Dilemma in Large Model Inference Acceleration

As LLM parameter scales expand, inference cost becomes a deployment bottleneck. Speculative Decoding (SD) improves throughput by generating candidates with a draft model and verifying them in parallel with a verification model, but traditional SD uses a binary decision mechanism:
- Either fully accept candidate tokens or completely reject them and recompute
- A large number of medium-confidence tokens are rejected and require calling the full large model, leading to a waste of computing resources

This "one-size-fits-all" strategy restricts the efficiency improvement of SD.

## VIA-SD's Three-Level Architecture and In-Model Routing Technology

### Three-Level Verification Architecture
1. **High-confidence tokens**: Directly accepted without additional verification
2. **Medium-confidence tokens**: Activate lightweight verifiers (slim-verifiers) derived from the main model for processing
3. **Low-confidence tokens**: Call the full verification model for verification

### Advantages of In-Model Routing Design
- Lightweight verifiers share parameters with the main model, no additional storage overhead
- Inherit the main model's knowledge, avoiding knowledge gaps of independent small models
- Seamlessly integrate with existing SD frameworks without modifying training processes or architectures

This design achieves refined allocation of computing resources.

## Experimental Verification: Significant Performance Improvement Data

Experimental results on four representative tasks:
- **Reduced Rejection Rate**: Token rejection rate decreases by 0.10-0.22, more candidate tokens are effectively utilized
- **Relative Acceleration**: Achieves an additional 10-20% acceleration compared to strong baseline SD methods
- **Absolute Acceleration**: Achieves 2.5-3x inference acceleration compared to non-speculative decoding

This verifies the actual performance gains of the three-level strategy.

## Compatibility Advantages and Technical Significance

### Compatibility
VIA-SD can be directly applied to already trained SD systems without retraining draft/verification models, allowing engineers to deploy quickly and gain performance improvements.

### Technical Significance
VIA-SD marks the evolution of speculative decoding from "binary decision" to "multi-level refined verification", revealing that inference acceleration requires intelligent allocation of computing resources during the verification phase.

## Insights and Future Directions

The idea of VIA-SD provides references for large model inference optimization:
- Future can explore schemes based on confidence stratification and dynamic resource scheduling
- Promote efficient deployment of large models in edge devices, real-time interaction, and other scenarios

Core insight: Efficiency improvement does not lie in increasing computation, but in smarter allocation of existing computing resources.
