Zing Forum

Reading

LLM Inference Optimization on Taiwania 2 Supercomputer: Throughput Experiments on V100 Cluster

LLM inference throughput experiments conducted on V100 GPU nodes of the Taiwania 2 supercomputer, exploring methods to maximize the inference efficiency of large language models in HPC environments.

LLM推理vLLMV100HPC台湾杉2号超算连续批处理GPU集群吞吐量优化模型部署
Published 2026-06-01 19:45Recent activity 2026-06-01 19:53Estimated read 6 min
LLM Inference Optimization on Taiwania 2 Supercomputer: Throughput Experiments on V100 Cluster
1

Section 01

Introduction: LLM Inference Optimization Experiments on Taiwania 2 V100 Cluster

This article introduces the open-source project LlmInferenceOnTaiwania, documenting LLM inference optimization experiments on the V100 GPU cluster of the Taiwania 2 supercomputer. It explores methods to maximize inference throughput in HPC environments and provides practical experience for model deployment. The core focuses on the application and optimization strategies of the vLLM engine.

2

Section 02

Experiment Background and Hardware Platform Introduction

Taiwania 2 Hardware Specifications

  • 252 GPU nodes, totaling 2016 NVIDIA V100 GPUs
  • Single node: 8 V100 GPUs (32GB HBM2 memory) + 2 Intel Xeon Gold CPUs
  • Interconnect: NVLink + InfiniBand EDR

Core Problem

Under HPC resource constraints (1-hour job duration, maximum 2 nodes/16 V100 GPUs), how to maximize the aggregated output token throughput of LLM inference? Its significance includes cost reduction, latency reduction, and improved resource utilization.

3

Section 03

Inference Engine Selection and Optimization Strategies

Core Technologies of vLLM Engine

  • PagedAttention: Draws on virtual memory paging to split KV cache into blocks, improving memory utilization
  • Continuous batching: Dynamically adds/removes requests to avoid idle waiting in static batching
  • Version selected: vLLM 0.7.0 (compatible with V100's Compute Capability 7.0)

Experimental Optimization Strategies

  • Tensor parallelism: Split model parameters across multiple GPUs
  • Pipeline parallelism: Split the model by layers to form a pipeline
  • Batch size tuning: Balance memory and computing power utilization
  • Quantization techniques: Explore INT8/FP16 to reduce memory usage

Test configuration: 2 nodes (16 V100 GPUs), 1-hour job, covering different input/output lengths.

4

Section 04

Experimental Results and Key Findings

Key Findings

Continuous batching is the most critical optimization method, with advantages including:

  1. Eliminates idle waiting in static batching
  2. Adapts to variable-length sequences in real scenarios
  3. Significantly improves GPU utilization

Other Optimization Effects

  • Multi-GPU parallelism: 16 V100 GPUs achieve near-linear throughput scaling
  • Memory optimization: Adjust KV cache to support longer context
  • Scheduling strategy: Optimize resource allocation

The results validate the effectiveness of vLLM's design philosophy.

5

Section 05

Practical Insights and Best Practices

  1. Framework selection: vLLM is suitable for high-throughput scenarios; for latency-sensitive scenarios, consider TensorRT-LLM
  2. Version matching: For older GPUs (e.g., V100), prioritize compatible stable versions over the latest ones
  3. HPC scheduling: Use scheduling systems like SLURM to allocate resources reasonably
  4. Monitoring and tuning: Establish a monitoring system to continuously collect performance data for optimization

The project provides reusable configurations and scripts to lower deployment barriers.

6

Section 06

Project Limitations and Hardware Challenges

Inherent limitations of V100 hardware:

  1. Memory capacity: 32GB is tight for models with 70B+ parameters, requiring model parallelism
  2. Computing capability: Does not support new features like sparse computing, so efficiency is lower than A100/H100
  3. Interconnect bandwidth: NVLink bandwidth is lower than newer generations, which may become a bottleneck in large-scale parallelism

These limitations affect model scale and optimization effects.

7

Section 07

Project Summary and Future Outlook

This project proves that on older hardware (V100), satisfactory inference throughput can be achieved through software optimization (especially continuous batching). It provides a reusable solution for research institutions and data support for hardware upgrades.

We look forward to more open-source projects promoting the popularization of AI in the HPC field and facilitating the application of large language models in scientific research.