# Used RTX 2080 Ti Dual Cards Running 27B Large Model Locally: vLLM 2080 Ti Definitive Edition Practical Guide

> Dual modified RTX 2080 Ti 22GB graphics cards connected via NVLink, paired with the vLLM 2080 Ti Definitive Edition runtime, can achieve equivalent or even stronger local large model inference performance at half the price of an RTX 3090 Ti.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-03T07:44:36.000Z
- 最近活动: 2026-06-03T07:50:52.798Z
- 热度: 161.9
- 关键词: vLLM, RTX 2080 Ti, 本地大模型, NVLink, Qwen, 量化推理, MTP推测解码, 显存优化, 开源LLM部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/rtx-2080-ti27b-vllm-2080-ti
- Canonical: https://www.zingnex.cn/forum/thread/rtx-2080-ti27b-vllm-2080-ti
- Markdown 来源: floors_fallback

---

## 【Introduction】Used RTX 2080 Ti Dual Cards Running 27B Large Model: Core Summary of vLLM Definitive Edition Practical Guide

**Original Author & Source**
- Original Author/Maintainer: weicj
- Source Platform: GitHub
- Original Title: vLLM-2080Ti-Definitive: The definitive vLLM runtime for dual RTX 2080 Ti 22GB + NVLink
- Original Link: https://github.com/weicj/vLLM-2080Ti-Definitive
- Release Date: June 3, 2026

**Core Points**
Dual modified 22GB RTX 2080 Ti graphics cards connected via NVLink, paired with the vLLM 2080 Ti Definitive Edition runtime, can achieve equivalent or even stronger local large model inference performance at half the price of a used RTX 3090 Ti (approximately $550). It supports models like Qwen3.6 27B and Gemma4 31B, with a single-request decoding speed of over 100 tokens/second and natively supports a 262K context length.

## Background: New Life for Old Graphics Cards & Project Goals

NVIDIA released the RTX 2080 Ti in August 2018; seven years later, modified 22GB memory versions are active in the used market. Paired with NVLink bridging, this graphics card combination has found a second life in the field of local large model inference.

The vLLM 2080 Ti Definitive project has a clear goal: to build a dual-card 2080 Ti platform at about half the cost of a used RTX 3090 Ti, run models with 27B-31B parameters, and achieve a decoding speed of over 100 tok/s and 262K context support.

## Hardware Foundation: Competitiveness Analysis of Dual 2080 Ti

Dual 2080 Ti 22GB + NVLink has significant hardware parameter advantages over RTX 3090 Ti:
| Metric | Dual 2080 Ti 22GB + NVLink | RTX 3090 Ti 24GB | Multiple |
|------|------------------------|------------------|------|
| CUDA Cores | 8,704 | 5,376 | 1.62x |
| SM Units | 136 | 84 | 1.62x |
| Tensor Cores | 1,088 | 336 | 3.24x |
| FP16 Matrix Throughput | 228 TFLOPS | 160 TFLOPS | 1.43x |
| Total Memory Bandwidth | 1,232 GB/s | 1,008 GB/s | 1.22x |
| Total Memory Capacity | 44GB | 24GB | 1.83x |
| Used Reference Price | ~$550 (including NVLink) | ~$1,100 | 0.5x |

The dual cards achieve 44GB memory via NVLink, which is sufficient to accommodate 27B-31B quantized models, and have sufficient computing resources.

## Software Stack Optimization: Core Technology Analysis

The project integrates multiple key optimization technologies:
- **Marlin Quantization Format**: Optimized for SM75 architecture, balancing precision and memory usage;
- **FlashQLA/FlashInfer/FlashAttention2**: Improve throughput in the prefill phase;
- **TurboQuant & INT8 KV Cache**: Compress key-value cache to support longer context;
- **Native MTP Speculative Decoding**: Generate multiple tokens in one forward pass to accelerate decoding;
- **CUDA Graph Optimization**: Reduce CPU overhead and lower latency jitter.

## Practical Configuration: Recommended Scheme for Qwen3.6 27B

Taking Qwen3.6 27B as the core, three KV cache precision schemes and recommended configurations are provided:

**KV Cache Precision Comparison**
| Feature | FP16 KV | INT8 KV | TQ4NC KV |
|---------|---------|---------|----------|
| Marlin Weight Quantization | ✅ AWQ/GPTQ | ✅ AWQ/GPTQ | ✅ AWQ/GPTQ |
| Native MTP3 Decoding | ✅ High speed for short context | ✅ Balance between capacity and speed | ✅ Compressed capacity |
| Native 262K Context | ✅ No MTP support | ⚠️ Candidate scheme | ✅ Recommended for services |
| Multimodal Image Service | ✅ Default route | 🔴 Output corrupted | ✅ Recommended for images |

**Recommended Configurations**
1. High-quality native context: FP16 KV + 262K context (no MTP);
2. Short context high speed: FP16 KV +8K-16K + MTP3;
3. High compression capacity: TQ4NC KV +262K + MTP3;
4. Multimodal service: TQ4NC KV +262K + MTP3.

## Performance Test: Actual Performance of Qwen3.6 27B

Qwen3.6 27B performance test results:
- **Prefill**: Reaches 1747 tok/s at 4096 token length, first response latency for long documents <3 seconds;
- **Decoding**: When outputting 128 tokens, MTP3 mode reaches over 100 tok/s, close to a smooth streaming experience;
- MTP3 is the recommended value: balances acceptance rate and actual throughput; although MTP5 has a higher theoretical value, it is not practical enough.

## Limitations & Notes

The project has the following limitations:
1. **Non-multi-tenant architecture**: Optimized for single concurrency; multiple agents require queue isolation;
2. **INT8 KV image service issue**: Text works normally, but output is corrupted in image scenarios;
3. **FP16 262K context limitation**: Only supports real long prompts in non-MTP mode; MTP3 mode is prone to OOM (Out of Memory).

## Summary & Recommendations: Value Mining of Old Hardware

**Summary**
This project demonstrates the value of reusing old hardware: the seven-year-old 2080 Ti can run mainstream medium-scale models through software optimization, with performance exceeding that of a new-generation single card at double the price.

**Recommendations**
Developers with limited budgets can choose this scheme, no need for the latest hardware investment, and tap into the potential of old hardware through open-source optimization. The threshold for large model inference lies more in the software stack's full utilization of hardware.
