# LLM Inference Optimization on Taiwania 2 Supercomputer: Throughput Experiments on V100 Cluster

> LLM inference throughput experiments conducted on V100 GPU nodes of the Taiwania 2 supercomputer, exploring methods to maximize the inference efficiency of large language models in HPC environments.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-01T11:45:38.000Z
- 最近活动: 2026-06-01T11:53:08.403Z
- 热度: 154.9
- 关键词: LLM推理, vLLM, V100, HPC, 台湾杉2号, 超算, 连续批处理, GPU集群, 吞吐量优化, 模型部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/2llm-v100
- Canonical: https://www.zingnex.cn/forum/thread/2llm-v100
- Markdown 来源: floors_fallback

---

## Introduction: LLM Inference Optimization Experiments on Taiwania 2 V100 Cluster

This article introduces the open-source project LlmInferenceOnTaiwania, documenting LLM inference optimization experiments on the V100 GPU cluster of the Taiwania 2 supercomputer. It explores methods to maximize inference throughput in HPC environments and provides practical experience for model deployment. The core focuses on the application and optimization strategies of the vLLM engine.

## Experiment Background and Hardware Platform Introduction

### Taiwania 2 Hardware Specifications
- 252 GPU nodes, totaling 2016 NVIDIA V100 GPUs
- Single node: 8 V100 GPUs (32GB HBM2 memory) + 2 Intel Xeon Gold CPUs
- Interconnect: NVLink + InfiniBand EDR

### Core Problem
Under HPC resource constraints (1-hour job duration, maximum 2 nodes/16 V100 GPUs), how to maximize the aggregated output token throughput of LLM inference? Its significance includes cost reduction, latency reduction, and improved resource utilization.

## Inference Engine Selection and Optimization Strategies

### Core Technologies of vLLM Engine
- **PagedAttention**: Draws on virtual memory paging to split KV cache into blocks, improving memory utilization
- **Continuous batching**: Dynamically adds/removes requests to avoid idle waiting in static batching
- Version selected: vLLM 0.7.0 (compatible with V100's Compute Capability 7.0)

### Experimental Optimization Strategies
- Tensor parallelism: Split model parameters across multiple GPUs
- Pipeline parallelism: Split the model by layers to form a pipeline
- Batch size tuning: Balance memory and computing power utilization
- Quantization techniques: Explore INT8/FP16 to reduce memory usage

Test configuration: 2 nodes (16 V100 GPUs), 1-hour job, covering different input/output lengths.

## Experimental Results and Key Findings

### Key Findings
Continuous batching is the most critical optimization method, with advantages including:
1. Eliminates idle waiting in static batching
2. Adapts to variable-length sequences in real scenarios
3. Significantly improves GPU utilization

### Other Optimization Effects
- Multi-GPU parallelism: 16 V100 GPUs achieve near-linear throughput scaling
- Memory optimization: Adjust KV cache to support longer context
- Scheduling strategy: Optimize resource allocation

The results validate the effectiveness of vLLM's design philosophy.

## Practical Insights and Best Practices

1. **Framework selection**: vLLM is suitable for high-throughput scenarios; for latency-sensitive scenarios, consider TensorRT-LLM
2. **Version matching**: For older GPUs (e.g., V100), prioritize compatible stable versions over the latest ones
3. **HPC scheduling**: Use scheduling systems like SLURM to allocate resources reasonably
4. **Monitoring and tuning**: Establish a monitoring system to continuously collect performance data for optimization

The project provides reusable configurations and scripts to lower deployment barriers.

## Project Limitations and Hardware Challenges

Inherent limitations of V100 hardware:
1. **Memory capacity**: 32GB is tight for models with 70B+ parameters, requiring model parallelism
2. **Computing capability**: Does not support new features like sparse computing, so efficiency is lower than A100/H100
3. **Interconnect bandwidth**: NVLink bandwidth is lower than newer generations, which may become a bottleneck in large-scale parallelism

These limitations affect model scale and optimization effects.

## Project Summary and Future Outlook

This project proves that on older hardware (V100), satisfactory inference throughput can be achieved through software optimization (especially continuous batching). It provides a reusable solution for research institutions and data support for hardware upgrades.

We look forward to more open-source projects promoting the popularization of AI in the HPC field and facilitating the application of large language models in scientific research.
