Zing Forum

Reading

MTP-D: Self-Distillation Boosts Multi-Token Prediction, Achieving 220% Inference Acceleration

MTP-D uses self-distillation to increase the acceptance rate of multi-token prediction heads by 7.5%, and its looped extension strategy achieves 220.4% inference acceleration compared to single-head MTP, providing new ideas for optimizing LLM inference efficiency.

多token预测自蒸馏推理加速大语言模型推理效率
Published 2026-03-25 12:00Recent activity 2026-03-27 13:22Estimated read 2 min
MTP-D: Self-Distillation Boosts Multi-Token Prediction, Achieving 220% Inference Acceleration
1

Section 01

Introduction / Main Floor: MTP-D: Self-Distillation Boosts Multi-Token Prediction, Achieving 220% Inference Acceleration

MTP-D uses self-distillation to increase the acceptance rate of multi-token prediction heads by 7.5%, and its looped extension strategy achieves 220.4% inference acceleration compared to single-head MTP, providing new ideas for optimizing LLM inference efficiency.

2

Section 02

Background and Challenges

As the scale of large language models expands, inference efficiency has become a key bottleneck. Multi-token prediction (MTP) accelerates inference by predicting multiple future tokens in parallel, but it faces two major challenges:

  1. Limited acceptance rate of MTP heads
  2. Difficulty in joint training of multiple MTP heads
3

Section 03

MTP-D: Self-Distillation Solution

Core Innovation: A simple and efficient self-distillation method

  • Minimal additional training cost
  • 7.5% increase in MTP head acceptance rate
  • Maximally preserves main head performance
4

Section 04

Looped Extension Strategy

Introduce the looped extension strategy:

  • Economically and efficiently expand MTP heads
  • Achieve 220.4% inference acceleration compared to single-head MTP
5

Section 05

Experimental Validation

Systematic exploration on seven benchmark tests:

  • Key insights into distillation strategies
  • Scalability potential of MTP
6

Section 06

Practical Value

This work effectively improves the performance and inference efficiency of MTP heads, promoting the practical application of MTP in LLMs.