Zing Forum

Reading

Used RTX 2080 Ti Dual Cards Running 27B Large Model Locally: vLLM 2080 Ti Definitive Edition Practical Guide

Dual modified RTX 2080 Ti 22GB graphics cards connected via NVLink, paired with the vLLM 2080 Ti Definitive Edition runtime, can achieve equivalent or even stronger local large model inference performance at half the price of an RTX 3090 Ti.

vLLMRTX 2080 Ti本地大模型NVLinkQwen量化推理MTP推测解码显存优化开源LLM部署
Published 2026-06-03 15:44Recent activity 2026-06-03 15:50Estimated read 8 min
Used RTX 2080 Ti Dual Cards Running 27B Large Model Locally: vLLM 2080 Ti Definitive Edition Practical Guide
1

Section 01

【Introduction】Used RTX 2080 Ti Dual Cards Running 27B Large Model: Core Summary of vLLM Definitive Edition Practical Guide

Original Author & Source

  • Original Author/Maintainer: weicj
  • Source Platform: GitHub
  • Original Title: vLLM-2080Ti-Definitive: The definitive vLLM runtime for dual RTX 2080 Ti 22GB + NVLink
  • Original Link: https://github.com/weicj/vLLM-2080Ti-Definitive
  • Release Date: June 3, 2026

Core Points Dual modified 22GB RTX 2080 Ti graphics cards connected via NVLink, paired with the vLLM 2080 Ti Definitive Edition runtime, can achieve equivalent or even stronger local large model inference performance at half the price of a used RTX 3090 Ti (approximately $550). It supports models like Qwen3.6 27B and Gemma4 31B, with a single-request decoding speed of over 100 tokens/second and natively supports a 262K context length.

2

Section 02

Background: New Life for Old Graphics Cards & Project Goals

NVIDIA released the RTX 2080 Ti in August 2018; seven years later, modified 22GB memory versions are active in the used market. Paired with NVLink bridging, this graphics card combination has found a second life in the field of local large model inference.

The vLLM 2080 Ti Definitive project has a clear goal: to build a dual-card 2080 Ti platform at about half the cost of a used RTX 3090 Ti, run models with 27B-31B parameters, and achieve a decoding speed of over 100 tok/s and 262K context support.

3

Section 03

Hardware Foundation: Competitiveness Analysis of Dual 2080 Ti

Dual 2080 Ti 22GB + NVLink has significant hardware parameter advantages over RTX 3090 Ti:

Metric Dual 2080 Ti 22GB + NVLink RTX 3090 Ti 24GB Multiple
CUDA Cores 8,704 5,376 1.62x
SM Units 136 84 1.62x
Tensor Cores 1,088 336 3.24x
FP16 Matrix Throughput 228 TFLOPS 160 TFLOPS 1.43x
Total Memory Bandwidth 1,232 GB/s 1,008 GB/s 1.22x
Total Memory Capacity 44GB 24GB 1.83x
Used Reference Price ~$550 (including NVLink) ~$1,100 0.5x

The dual cards achieve 44GB memory via NVLink, which is sufficient to accommodate 27B-31B quantized models, and have sufficient computing resources.

4

Section 04

Software Stack Optimization: Core Technology Analysis

The project integrates multiple key optimization technologies:

  • Marlin Quantization Format: Optimized for SM75 architecture, balancing precision and memory usage;
  • FlashQLA/FlashInfer/FlashAttention2: Improve throughput in the prefill phase;
  • TurboQuant & INT8 KV Cache: Compress key-value cache to support longer context;
  • Native MTP Speculative Decoding: Generate multiple tokens in one forward pass to accelerate decoding;
  • CUDA Graph Optimization: Reduce CPU overhead and lower latency jitter.
5

Section 05

Practical Configuration: Recommended Scheme for Qwen3.6 27B

Taking Qwen3.6 27B as the core, three KV cache precision schemes and recommended configurations are provided:

KV Cache Precision Comparison

Feature FP16 KV INT8 KV TQ4NC KV
Marlin Weight Quantization ✅ AWQ/GPTQ ✅ AWQ/GPTQ ✅ AWQ/GPTQ
Native MTP3 Decoding ✅ High speed for short context ✅ Balance between capacity and speed ✅ Compressed capacity
Native 262K Context ✅ No MTP support ⚠️ Candidate scheme ✅ Recommended for services
Multimodal Image Service ✅ Default route 🔴 Output corrupted ✅ Recommended for images

Recommended Configurations

  1. High-quality native context: FP16 KV + 262K context (no MTP);
  2. Short context high speed: FP16 KV +8K-16K + MTP3;
  3. High compression capacity: TQ4NC KV +262K + MTP3;
  4. Multimodal service: TQ4NC KV +262K + MTP3.
6

Section 06

Performance Test: Actual Performance of Qwen3.6 27B

Qwen3.6 27B performance test results:

  • Prefill: Reaches 1747 tok/s at 4096 token length, first response latency for long documents <3 seconds;
  • Decoding: When outputting 128 tokens, MTP3 mode reaches over 100 tok/s, close to a smooth streaming experience;
  • MTP3 is the recommended value: balances acceptance rate and actual throughput; although MTP5 has a higher theoretical value, it is not practical enough.
7

Section 07

Limitations & Notes

The project has the following limitations:

  1. Non-multi-tenant architecture: Optimized for single concurrency; multiple agents require queue isolation;
  2. INT8 KV image service issue: Text works normally, but output is corrupted in image scenarios;
  3. FP16 262K context limitation: Only supports real long prompts in non-MTP mode; MTP3 mode is prone to OOM (Out of Memory).
8

Section 08

Summary & Recommendations: Value Mining of Old Hardware

Summary This project demonstrates the value of reusing old hardware: the seven-year-old 2080 Ti can run mainstream medium-scale models through software optimization, with performance exceeding that of a new-generation single card at double the price.

Recommendations Developers with limited budgets can choose this scheme, no need for the latest hardware investment, and tap into the potential of old hardware through open-source optimization. The threshold for large model inference lies more in the software stack's full utilization of hardware.