Zing Forum

Reading

Performance Evaluation Study of LLM Inference Frameworks at Boğaziçi University: In-depth Analysis of vLLM and PagedAttention

A graduation project from the Department of Computer Engineering at Boğaziçi University in Turkey, which systematically conducts benchmark testing and optimization analysis of large language model (LLM) inference frameworks, with a focus on the performance of vLLM and the PagedAttention mechanism.

LLM推理vLLMPagedAttention性能优化大语言模型基准测试
Published 2026-05-04 06:41Recent activity 2026-05-04 06:50Estimated read 6 min
Performance Evaluation Study of LLM Inference Frameworks at Boğaziçi University: In-depth Analysis of vLLM and PagedAttention
1

Section 01

[Introduction] Core Overview of the Performance Evaluation Study on LLM Inference Frameworks at Boğaziçi University

A graduation project from the Department of Computer Engineering at Boğaziçi University in Turkey, which systematically conducts benchmark testing and optimization analysis of large language model (LLM) inference frameworks, focusing on the performance of vLLM and its underlying PagedAttention mechanism. This study provides important reference for the commercial feasibility and industrial deployment of LLM inference services.

2

Section 02

Research Background and Motivation

Large Language Model (LLM) inference services are core components of AI infrastructure; inference efficiency and cost control directly affect the commercialization of the technology. However, LLM inference faces challenges such as large parameter sizes, autoregressive generation characteristics, and variable input/output lengths, making traditional optimization methods difficult to apply. This graduation project at Boğaziçi University focuses on the performance evaluation of LLM inference frameworks, reflecting the academic community's attention to AI engineering practice.

3

Section 03

Technical Analysis of vLLM and PagedAttention

vLLM is an open-source LLM inference engine developed by the SkyLab team at the University of California, Berkeley, with its core innovation being the PagedAttention mechanism. In traditional LLM inference, pre-allocating continuous memory for KV caches leads to waste and fragmentation, limiting concurrent requests. PagedAttention draws on the idea of virtual memory paging, dividing KV caches into fixed blocks, allocating them on demand, and supporting block sharing, thereby improving memory efficiency and concurrency capabilities.

4

Section 04

Benchmark Testing Methodology

The study adopts a scientific experimental design and a multi-dimensional evaluation system. Model selection covers scales from billions to hundreds of billions of parameters; load design considers input/output length distribution, request arrival patterns, etc., to simulate real-world scenarios; evaluation metrics include system-level indicators such as throughput, latency, memory utilization efficiency, GPU utilization, and energy consumption, ensuring comprehensive and reproducible results.

5

Section 05

Exploration of Performance Optimization Strategies

Based on benchmark testing, multi-level optimizations are explored: batch processing strategies compare continuous batching (dynamically adding requests to maintain high GPU utilization) and dynamic batching (combining requests to improve parallelism); in terms of memory optimization, in addition to PagedAttention, quantization techniques (such as converting FP16 to INT8) are studied to reduce memory usage and improve throughput, but trade-offs with precision loss are necessary.

6

Section 06

Research Findings and Industry Implications

The PagedAttention mechanism of vLLM significantly improves memory efficiency, which has direct economic significance for reducing inference costs; performance optimization requires multi-objective trade-offs (e.g., throughput vs. latency, memory usage vs. computational complexity), and there is no universal optimal configuration; the open-source community drives technological progress, and the open-source release of vLLM and this study accelerates industry development.

7

Section 07

Educational Value and Academic Contributions

As a graduation project, it cultivates students' interdisciplinary application abilities (operating systems, parallel computing, machine learning, etc.); academically, it provides a reference example for empirical research on LLM inference, supplements industry reports, and offers a comprehensive perspective.

8

Section 08

Future Outlook

LLM inference technology continues to develop, with new technologies such as speculative decoding, MoE optimization, and hardware-customized kernels emerging; performance evaluation needs to be continuously updated. This project provides readers with a starting point for in-depth understanding of LLM inference systems, and it is recommended to further explore the latest research progress.