Zing Forum

Reading

Evaluation of LLM Inference Frameworks at Boğaziçi University: In-depth Analysis of vLLM and PagedAttention

A graduation project from Turkey's top university Boğaziçi University systematically evaluates mainstream LLM inference frameworks and deeply analyzes the PagedAttention mechanism of vLLM and its performance characteristics.

vLLMPagedAttentionLLM推理性能评测Boğaziçi大学KV Cache优化大模型部署推理框架对比
Published 2026-04-22 03:45Recent activity 2026-04-22 03:51Estimated read 6 min
Evaluation of LLM Inference Frameworks at Boğaziçi University: In-depth Analysis of vLLM and PagedAttention
1

Section 01

Introduction: Core Interpretation of Boğaziçi University's LLM Inference Framework Evaluation Project

The graduation project PERFORMANCE-EVALUATIONS-OF-LLM-INFERENCE-FRAMEWORKS from Turkey's Boğaziçi University (Istanbul Strait University) has been open-sourced. It systematically evaluates mainstream LLM inference frameworks, with a focus on analyzing vLLM and its core PagedAttention mechanism, providing data support for framework selection in production environments. This article will interpret the research results and practical value of the project.

2

Section 02

Project Background: Boğaziçi University and Research Objectives

Founded in 1863, Boğaziçi University is one of the oldest and most academically reputable universities in Turkey, with its Engineering Faculty enjoying high prestige in the Middle East. This graduation project was completed by a team of senior students from the Department of Computer Engineering, with objectives including: establishing a systematic evaluation methodology for LLM inference frameworks; quantitatively analyzing the benefits of vLLM's PagedAttention mechanism; comparing performance characteristics of frameworks such as vLLM, TensorRT-LLM, and DeepSpeed-Inference; and providing data support for framework selection in production environments.

3

Section 03

Evaluation Methodology: Experimental Design and Metric System

Test Models: Llama-2-7B/13B/70B, Mistral-7B-Instruct, OPT-13B; Datasets: Short text generation (<500 tokens), long text generation (1k-4k tokens), mixed load; Evaluation Metrics: Throughput (Token Throughput, Request Throughput, TTFT, TPOT), resource efficiency (GPU memory utilization, KV Cache efficiency, energy consumption), service quality (P99 latency, throughput-latency trade-off); Hardware Environment: NVIDIA A100 80GB SXM4, AMD EPYC 7742 (64 cores), 512GB DDR4, InfiniBand HDR.

4

Section 04

Key Findings: Performance Advantages of PagedAttention and Framework Comparison

The PagedAttention mechanism of vLLM draws on the idea of virtual memory management, solving the problems of memory waste, fragmentation, and inability to dynamically expand in traditional KV Cache. Evaluation results show: In high-concurrency scenarios, vLLM's throughput is 3-5 times higher than Hugging Face Transformers, with GPU memory utilization reaching over 85% and P99 latency reduced by 60%; in variable-length sequence scenarios, the fragmentation rate drops to <5%; in Beam Search scenarios, memory usage is reduced by 40-60%. Framework Comparison: TensorRT-LLM has excellent single-card performance but long compilation time; DeepSpeed-Inference has good multi-card scalability but low single-card throughput; llama.cpp is suitable for CPU inference but has low GPU utilization.

5

Section 05

Practical Insights and Future Directions

Selection Recommendations: Choose vLLM for ultimate throughput; choose TensorRT-LLM for the lowest single-card latency; use DeepSpeed-Inference + vLLM for ultra-large-scale models; choose llama.cpp for edge/CPU deployment. Tuning Recommendations: The default block_size is 16; try 8 for short sequences; enable CPU offload (--swap-space4) to support longer contexts; use priority scheduling to optimize multi-tenant fairness. Limitations: Limited model coverage (lacks MoE architecture), single hardware type (only A100), test data is synthetic. Future Directions: Multi-modal inference optimization, combination of speculative decoding and PagedAttention, heterogeneous computing analysis. Summary: This project provides valuable practice for LLM inference framework evaluation, verifies the key impact of memory management optimization, and has important reference value for deployment engineers. Project address: https://github.com/erayyuklu/PERFORMANCE-EVALUATIONS-OF-LLM-INFERENCE-FRAMEWORKS.