Reading

Used RTX 2080 Ti Dual Cards Running 27B Large Model Locally: vLLM 2080 Ti Definitive Edition Practical Guide

Dual modified RTX 2080 Ti 22GB graphics cards connected via NVLink, paired with the vLLM 2080 Ti Definitive Edition runtime, can achieve equivalent or even stronger local large model inference performance at half the price of an RTX 3090 Ti.

vLLMRTX 2080 Ti本地大模型NVLinkQwen量化推理MTP推测解码显存优化开源LLM部署

Published 2026-06-03 15:44Recent activity 2026-06-03 15:50Estimated read 8 min

Used RTX 2080 Ti Dual Cards Running 27B Large Model Locally: vLLM 2080 Ti Definitive Edition Practical Guide

Section 01

【Introduction】Used RTX 2080 Ti Dual Cards Running 27B Large Model: Core Summary of vLLM Definitive Edition Practical Guide

Original Author & Source

Original Author/Maintainer: weicj
Source Platform: GitHub
Original Title: vLLM-2080Ti-Definitive: The definitive vLLM runtime for dual RTX 2080 Ti 22GB + NVLink
Original Link: https://github.com/weicj/vLLM-2080Ti-Definitive
Release Date: June 3, 2026

Core Points Dual modified 22GB RTX 2080 Ti graphics cards connected via NVLink, paired with the vLLM 2080 Ti Definitive Edition runtime, can achieve equivalent or even stronger local large model inference performance at half the price of a used RTX 3090 Ti (approximately $550). It supports models like Qwen3.6 27B and Gemma4 31B, with a single-request decoding speed of over 100 tokens/second and natively supports a 262K context length.

Section 02

Background: New Life for Old Graphics Cards & Project Goals

NVIDIA released the RTX 2080 Ti in August 2018; seven years later, modified 22GB memory versions are active in the used market. Paired with NVLink bridging, this graphics card combination has found a second life in the field of local large model inference.

The vLLM 2080 Ti Definitive project has a clear goal: to build a dual-card 2080 Ti platform at about half the cost of a used RTX 3090 Ti, run models with 27B-31B parameters, and achieve a decoding speed of over 100 tok/s and 262K context support.

Section 03

Hardware Foundation: Competitiveness Analysis of Dual 2080 Ti

Dual 2080 Ti 22GB + NVLink has significant hardware parameter advantages over RTX 3090 Ti:

Metric	Dual 2080 Ti 22GB + NVLink	RTX 3090 Ti 24GB	Multiple
CUDA Cores	8,704	5,376	1.62x
SM Units	136	84	1.62x
Tensor Cores	1,088	336	3.24x
FP16 Matrix Throughput	228 TFLOPS	160 TFLOPS	1.43x
Total Memory Bandwidth	1,232 GB/s	1,008 GB/s	1.22x
Total Memory Capacity	44GB	24GB	1.83x
Used Reference Price	~$550 (including NVLink)	~$1,100	0.5x

The dual cards achieve 44GB memory via NVLink, which is sufficient to accommodate 27B-31B quantized models, and have sufficient computing resources.

Section 04

Software Stack Optimization: Core Technology Analysis

The project integrates multiple key optimization technologies:

Marlin Quantization Format: Optimized for SM75 architecture, balancing precision and memory usage;
FlashQLA/FlashInfer/FlashAttention2: Improve throughput in the prefill phase;
TurboQuant & INT8 KV Cache: Compress key-value cache to support longer context;
Native MTP Speculative Decoding: Generate multiple tokens in one forward pass to accelerate decoding;
CUDA Graph Optimization: Reduce CPU overhead and lower latency jitter.

Section 05

Practical Configuration: Recommended Scheme for Qwen3.6 27B

Taking Qwen3.6 27B as the core, three KV cache precision schemes and recommended configurations are provided:

KV Cache Precision Comparison

Feature	FP16 KV	INT8 KV	TQ4NC KV
Marlin Weight Quantization	✅ AWQ/GPTQ	✅ AWQ/GPTQ	✅ AWQ/GPTQ
Native MTP3 Decoding	✅ High speed for short context	✅ Balance between capacity and speed	✅ Compressed capacity
Native 262K Context	✅ No MTP support	⚠️ Candidate scheme	✅ Recommended for services
Multimodal Image Service	✅ Default route	🔴 Output corrupted	✅ Recommended for images

Recommended Configurations

High-quality native context: FP16 KV + 262K context (no MTP);
Short context high speed: FP16 KV +8K-16K + MTP3;
High compression capacity: TQ4NC KV +262K + MTP3;
Multimodal service: TQ4NC KV +262K + MTP3.

Section 06

Performance Test: Actual Performance of Qwen3.6 27B

Qwen3.6 27B performance test results:

Prefill: Reaches 1747 tok/s at 4096 token length, first response latency for long documents <3 seconds;
Decoding: When outputting 128 tokens, MTP3 mode reaches over 100 tok/s, close to a smooth streaming experience;
MTP3 is the recommended value: balances acceptance rate and actual throughput; although MTP5 has a higher theoretical value, it is not practical enough.

Section 07

Limitations & Notes

The project has the following limitations:

Non-multi-tenant architecture: Optimized for single concurrency; multiple agents require queue isolation;
INT8 KV image service issue: Text works normally, but output is corrupted in image scenarios;
FP16 262K context limitation: Only supports real long prompts in non-MTP mode; MTP3 mode is prone to OOM (Out of Memory).

Section 08

Summary & Recommendations: Value Mining of Old Hardware

Summary This project demonstrates the value of reusing old hardware: the seven-year-old 2080 Ti can run mainstream medium-scale models through software optimization, with performance exceeding that of a new-generation single card at double the price.

Recommendations Developers with limited budgets can choose this scheme, no need for the latest hardware investment, and tap into the potential of old hardware through open-source optimization. The threshold for large model inference lies more in the software stack's full utilization of hardware.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49