Reading

Qwen3-VL OnDemand: On-Demand Loading Multimodal Model Proxy

A lightweight proxy service that allows visual-language models like Qwen3-VL to release VRAM when idle and automatically load upon request, balancing zero VRAM usage and fast response.

Qwen3-VL多模态显存优化按需加载llama.cpp视觉语言模型代理GPU 资源管理

Published 2026-05-10 18:20Recent activity 2026-05-10 18:52Estimated read 9 min

Section 01

Qwen3-VL OnDemand: Introduction to the On-Demand Loading Multimodal Model Proxy

Qwen3-VL OnDemand is a lightweight proxy service designed to solve the VRAM management problem of running multimodal visual-language models (such as Qwen3-VL) locally. Through a proxy relay architecture, it achieves zero VRAM usage when idle and automatic model loading upon request, balancing the needs of fast response and GPU resource release, allowing users to flexibly use multimodal models even in environments with limited VRAM.

Section 02

VRAM Dilemma of Running Multimodal Models Locally

For users running large language models locally, VRAM management is a major pain point—especially for multimodal visual-language models (VLMs) like Qwen3-VL, which usually occupy several gigabytes of VRAM after loading. There are two dilemmas:

Resident VRAM Mode: The model remains loaded, responding quickly but occupying GPU resources, making it impossible to perform other GPU tasks simultaneously;

Manual Start/Stop Mode: Start before use and close after use, saving VRAM but being cumbersome and time-consuming to load.

The qwen3-vl-ondemand project is designed to solve this dilemma, achieving a balance of 'zero VRAM when idle and on-demand automatic loading'.

Section 03

Core Design: Proxy Relay Architecture

The project adopts a proxy relay architecture, with core components including:

vl-relay.py (Relay Proxy): A lightweight service written purely in Python, occupying only a few MB of memory. It listens on a port to receive requests, manages the backend model lifecycle, and transparently forwards requests;

llama-server (Backend Service): An inference service provided by llama.cpp that runs the Qwen3-VL model, occupying about 3.8GB of VRAM. It starts only when there are requests and automatically shuts down when idle for a timeout period.

This architecture decouples the 'service entry' from 'model inference'—the relay proxy runs at all times, while the backend service starts and stops on demand.

Section 04

Workflow: Complete Cycle from Idle to Response

The complete request processing workflow is as follows:

Idle State: The relay proxy listens on the port, llama-server is not running, and VRAM usage is 0MB;

Request Arrival: The relay proxy detects that the backend is not running, automatically starts llama-server (takes about 1.5 seconds to load), and forwards the request;

In-Service State: llama-server remains running, subsequent requests are forwarded directly, with low response latency (about 100 tokens per second);

Idle Timeout: If there are no new requests for longer than the configured idle time (default 5 minutes), llama-server is automatically terminated to release VRAM.

This design balances the low latency of local models and the need for VRAM not to be occupied for long periods.

Section 05

Technical Highlights: Key Designs Ensuring Robustness

Key robustness design highlights in the project's engineering implementation:

PDEATHSIG Mechanism: Uses Linux system calls to ensure that when the parent process (relay proxy) exits, the child process (llama-server) terminates automatically, avoiding orphan processes;
Exec Startup Mode: start.sh uses exec to launch the relay proxy, replacing the shell process. When the terminal is closed, the relay proxy exits, triggering the termination of the child process;
Transparent Proxy Forwarding: Supports transparent forwarding of all HTTP methods, no need to handle API protocol details, compatible with text, visual, and other requests;
Pure Standard Library Implementation: vl-relay.py uses only Python standard libraries, with zero third-party dependencies, reducing deployment complexity and security risks.

Section 06

Performance: Measured Data on Consumer-Grade Graphics Cards

On a configuration of Ryzen7 9700X + RTX3060 12GB, measured data using the Qwen3-VL-4B Q4_K_M quantized model:

Metric	Value
Model VRAM Usage	~2.4GB
KV Cache VRAM (8K Context)	~1.2GB
Compute Buffer	~0.3GB
Total VRAM Usage	~3.8GB
Cold Start Time	~1.5 seconds
Text Generation Speed	~100 tokens/second
Idle VRAM Usage	0MB

This solution is feasible on consumer-grade graphics cards, with acceptable cold start latency and zero VRAM usage when idle to release GPU resources.

Section 07

Comparison with Existing Solutions: Advantage Analysis

Comparison with existing solutions:

Solution	VRAM Usage	Deployment Complexity	Flexibility
This Relay Solution	On-demand ✅	One command	Full control
Ollama Resident	Always occupied	Simple	Parameter-limited
Manual llama-server	Always occupied	Manual start/stop	Full control
vLLM	Always occupied + extra overhead	Complex	Production-grade

Compared to Ollama, this solution releases VRAM when idle; compared to manual management, it automates start/stop; compared to vLLM, it is easy to deploy and suitable for individuals and small teams.

Section 08

Summary and Application Scenarios

qwen3-vl-ondemand solves the VRAM management problem of local multimodal models through a proxy relay architecture, achieving zero VRAM when idle and automatic loading upon request, balancing convenience and resource release. It is suitable for:

Personal AI workstations (limited VRAM, need to flexibly switch GPU tasks);
Development and testing environments (occasional testing of multimodal functions);
Coexistence of multiple models (time-sharing GPU resources).

The project is compatible with mainstream AI clients and can be extended to any multimodal GGUF model supported by llama.cpp, making it a practical solution for users with limited VRAM to experience local multimodal AI.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15