Reading

Marmot: A Practical Tool for Accurately Estimating VRAM Required for LLM Deployment

Marmot is a VRAM estimation tool written in Rust that can quickly calculate the GPU memory required to deploy large language models (LLMs) from configuration files of Hugging Face or ModelScope, supporting dense, MoE, multimodal, and quantized models.

VRAMGPULLMdeploymentRustHugging FacequantizationMoEMixtralmemory estimation

Published 2026-04-17 12:31Recent activity 2026-04-17 12:51Estimated read 7 min

Marmot: A Practical Tool for Accurately Estimating VRAM Required for LLM Deployment

Section 01

Marmot: A Practical Tool for Precise VRAM Estimation in LLM Deployment

Marmot is an open-source command-line tool written in Rust that addresses the VRAM planning dilemma in LLM deployment. It can quickly calculate GPU memory required for deploying LLMs from Hugging Face/ModelScope configs, supporting Dense, MoE, multimodal, and quantized models. It solves common questions like VRAM requirements for different precisions, KV Cache impact, and MoE model differences, filling the gap in precise pre-deployment resource planning.

Section 02

The VRAM Dilemma in LLM Deployment

In LLM deployment, GPU VRAM planning is a key but underestimated problem. Developers often face questions: How much VRAM does a 70B parameter model need in FP16? How much can INT4 quantization reduce memory? What's KV Cache overhead for 32K context? Are MoE models like Mixtral different from Dense models? Traditional methods rely on rule of thumb or online calculators but lack model-specific precision—Marmot solves this.

Section 03

Marmot Tool Overview

Marmot is an open-source Rust tool with core capabilities:

Multi-source input: local configs, HTTP URLs, Hugging Face model IDs
Auto architecture detection: reads quantization precision and attention mechanisms
Full model coverage: Dense (LLaMA, Mistral, Qwen) and MoE (Mixtral, GLM-4)
GQA/MQA support
Precision comparison mode (FP16, INT8, INT4)
Open-source under MIT license.

Section 04

Installation & Usage Guide

Installation

Via Cargo: cargo install --git https://github.com/fagao-ai/marmot
Local build: git clone https://github.com/fagao-ai/marmot && cd marmot && cargo build --release

Basic Usage

Model ID: marmot meta-llama/Llama-2-7b-hf
Local file: marmot ./config.json
HTTP URL: marmot https://huggingface.co/mistralai/Mistral-7B-v0.1/raw/main/config.json

Advanced Usage

Context length: marmot meta-llama/Llama-2-7b-hf --context 16384
Precision comparison: marmot meta-llama/Llama-2-7b-hf --compare fp16,int8,int4
Separate KV Cache precision: marmot meta-llama/Llama-2-7b-hf --precision int4 --kv-dtype fp16

Section 05

Core Calculation Principles

Formula: vram = model_weights + kv_cache + runtime_overhead

Model Weights:

Dense models: Calculates embedding, attention, FFN, layer norms, lm_head.
MoE models: Separates shared components (embedding, attention) and expert FFN.

KV Cache: kv_cache = batch × seq_len × layers × kv_heads × head_dim × 2 × bytes (2 for K/V matrices; GQA reduces kv_heads).

Precision Formats:

Type	Bytes/Param	Note
FP32	4.0	Full precision
FP16	2.0	Half precision (default)
BF16	2.0	bfloat16
FP8	1.0	8-bit float
INT8	1.0	8-bit quantization
INT4	0.5	4-bit quantization

Section 06

Evidence & Validation

Output Examples:

Dense model: Mistral-7B with 32K context (FP16: 17.45GB total, weights:14.48GB, KV cache:2.15GB).
MoE model: Mixtral-8x7B (INT4 reduces VRAM by ~70% vs FP16).

Supported Models:

Dense: LLaMA, Mistral, Qwen, GPT-2.
MoE: Mixtral (8x7B/8x22B), GLM-4.
Attention: Full, GQA, MQA.

Validation: Estimates have 5-10% error, compatible with vLLM, SGLang, Hugging Face Transformers, TGI. Deviation factors: activation size, memory allocator overhead, framework optimizations.

Section 07

Application Scenarios

Marmot is useful for:

Deployment resource planning: Avoid OOM errors.
Quantization strategy selection: Compare precisions to decide.
Context length trade-off: Balance model capabilities and hardware limits.
Batch size optimization: Improve throughput.
Multi-model coexistence: Calculate total VRAM for multiple models on one GPU.

Section 08

Future Outlook & Summary

Future Directions:

Support more model architectures.
Add training VRAM estimation.
Support more quantization formats (GPTQ, AWQ, GGUF).
Estimate activation memory.

Summary: Marmot is a practical tool that helps developers avoid resource shortages, make informed quantization decisions, optimize hardware, and reduce trial costs. It's an essential addition to any LLM deployment toolkit.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15