Zing Forum

Reading

Padding Token Overhead in LLM Inference: An Efficiency Study Under Tensor and Pipeline Parallel Configurations

A systematic benchmark study on the impact of padding tokens on computational efficiency during large language model (LLM) inference. Based on empirical data from the Qwen2.5-32B model on an A100 GPU cluster, it reveals performance differences and optimization directions under different parallel strategies.

LLM推理填充令牌张量并行流水线并行Qwen2.5A100批处理优化推理效率
Published 2026-05-02 23:41Recent activity 2026-05-02 23:49Estimated read 6 min
Padding Token Overhead in LLM Inference: An Efficiency Study Under Tensor and Pipeline Parallel Configurations
1

Section 01

[Introduction] Efficiency Study of Padding Token Overhead in LLM Inference Under Parallel Configurations

This study conducts a systematic benchmark test on the impact of padding tokens on computational efficiency during large language model (LLM) inference. Based on empirical data from the Qwen2.5-32B model on an NVIDIA A100 GPU cluster, it reveals performance differences between two distributed parallel strategies—Tensor Parallelism (TP) and Pipeline Parallelism (PP)—and proposes corresponding optimization directions, providing data-driven decision support for LLM inference system architecture design.

2

Section 02

Research Background: Padding Tokens—The Invisible Killer of Inference Efficiency

Batch processing is key to improving GPU utilization in LLM deployment, but inconsistent input sequence lengths require padding tokens to unify lengths. These tokens do not participate in semantic computation but consume memory and computational resources, making them an invisible killer of efficiency. Current mainstream frameworks (e.g., vLLM, TensorRT-LLM) mitigate this issue via paged attention or dynamic batching, but there is a lack of systematic quantitative research on the specific overhead of padding tokens under different parallel configurations—this is the research value of this project.

3

Section 03

Experimental Design: Model, Hardware, and Parallel Strategies

This project is open-sourced by the divide-by-zer0 team, using the Qwen2.5-32B model tested on an A100 GPU cluster. The experiment covers two parallel strategies:

  • Tensor Parallelism (TP):Single-layer computation is split across multiple GPUs, suitable for single-node multi-card scenarios;
  • Pipeline Parallelism (PP):The model is deployed in layers across different GPUs, suitable for large-scale cross-node scenarios. The team uses the control variable method to test the impact of padding tokens on latency and throughput under combinations of different batch sizes, sequence lengths, and parallelism degrees.
4

Section 04

Key Findings: Overhead Differences of Padding Tokens Under Parallel Configurations

Key experimental insights:

  1. Under Tensor Parallelism, the padding overhead has a non-linear relationship with parallelism degree— increasing TP dimension reduces single-card pressure, but cross-card communication overhead is amplified by padding, which may offset gains (especially for batches with large sequence length differences);
  2. Pipeline Parallelism is better for processing variable-length sequences—padding only affects computation in its respective stage, with no cross-layer propagation overhead, and dynamic batching has better robustness;
  3. The impact of padding on memory is underestimated—a 10% padding ratio for a 32B model may lead to several GB of memory waste, which is more prominent in long-sequence scenarios (e.g., document understanding).
5

Section 05

Engineering Implications: Practical Recommendations for Padding Token Optimization

Based on the findings, the following optimization recommendations are proposed:

  1. Intelligent Batching: Dynamically select batch size based on sequence length distribution to avoid over-padding;
  2. Hybrid Parallelism: Use TP within nodes and PP across nodes to balance computation and communication overhead;
  3. Padding-Aware Scheduling: The scheduling layer estimates padding costs and prioritizes combining requests with similar lengths;
  4. Sequence Packing: Draw inspiration from image packing and explore non-contiguous memory attention computation.
6

Section 06

Industry Impact and Outlook: Future Directions for the Padding Overhead Problem

This study quantifies padding overhead and provides support for LLM inference architectures. As MoE (Mixture of Experts) and multimodal models become more prevalent, the padding problem will become more severe. Future expectations:

  • Hardware level: Sparse attention acceleration;
  • Algorithm level: Dynamic sequence reorganization;
  • System level: Request prediction and pre-padding. Innovation in the open-source community will drive the development of the field, and this study lays the foundation for subsequent work.