1. Hybrid Speculative Decoding
Qwen 3.5 uses a hybrid architecture of 24 GatedDeltaNet (GDN) layers + 8 full attention layers. Traditional speculative decoding faces a critical issue at the GDN layer: when a draft token is rejected, the KV cache can be rolled back, but the GDN's recurrent state and convolution buffer have already advanced through the entire draft window, leading to state corruption. m5-infer's solution is to snapshot all GDN layers' (recurrent_state, conv_buf) into a pre-allocated tensor pool before each validation. When rejected, it recovers from the snapshot in O(1) time with zero allocation on the hot path. In practice, this brings a 35% throughput improvement (from 29 to 40 tok/s) on Qwen 3.5 9B, with an acceptance rate of about 70%.
2. Cross-Turn State Persistence (CTRSP)
After each generation round, m5-infer serializes the complete model state (quantized KV cache + GDN recurrent/convolution buffer) to disk, using the hash of the original bytes of the prompt prefix tokens as the key. Since the hash is based on token bytes rather than decoded text, the same system prompt and tool mode can hit the cache even with different user inputs attached. Effect: The warm-up TTFT for the 12K token tool mode is reduced from 11s to 2-3s, and the cache hit rate for typical agent workloads exceeds 90%.
3. Thought-Aware Budgeting and Escape Prompts
Qwen 3.5's chain-of-thought is wrapped in ... tags. Common failure modes include:
- Budget Starvation: Most engines count thought tokens towards the user's max_tokens, leading to truncation in the answer phase
- Thought Loop Trap: The model gets stuck in an infinite loop like "Wait, let me re-check..."
m5-infer's solutions:
- Separate thought budget (max_thinking_tokens, default 32K), where the user's max_tokens is only used for the answer phase
- Run a 6-gram repetition detector inside the thought block (threshold of 3 repetitions)
- When a loop is detected, inject a typed transition prompt (e.g., "Final JSON:") to force the model into the desired output format
Effect: Structured JSON extraction task score increased from 1.40 to 7.85 (+461%), and code generation from 3.10 to 6.55 (+111%).
4. Needle-Retrieval Heuristic
Qwen 3.5 has a safety alignment issue when thought mode is disabled: in long contexts (12K+) with short retrieval queries, it sometimes refuses to answer, claiming "cannot disclose authoritative information"—even if the information comes from the user's own provided content. m5-infer automatically detects long context + short query mode at the routing layer and forces thought mode to be enabled, thus bypassing this limitation. In practice, the long context retrieval success rate increased from 0/6 to 6/6.
5. Adaptive Layer Skipping (ALS)
For "simple" tokens, skip layers with minimal impact to reduce computation.
6. Self-Speculative Early Exit (SSEE)
An internal speculative decoding mechanism of the model that terminates generation early when confidence is high.
7. Parallel Expert Scheduling (PES)
Concurrently execute multiple expert paths in MoE (Mixture of Experts) models.
8. X5-R Compiled Forward Propagation
Metal kernel fusion via mx.compile brings about a 40% throughput improvement (from 17 to 24 tok/s).