Zing 论坛

正文

Marmot:精准估算 LLM 部署所需 VRAM 的实用工具

Marmot 是一款用 Rust 编写的 VRAM 估算工具,能够从 Hugging Face 或 ModelScope 的配置文件中快速计算出部署大语言模型所需的 GPU 显存,支持 Dense、MoE、多模态和量化模型。

VRAMGPULLMdeploymentRustHugging FacequantizationMoEMixtralmemory estimation
发布时间 2026/04/17 12:31最近活动 2026/04/17 12:51预计阅读 7 分钟
Marmot:精准估算 LLM 部署所需 VRAM 的实用工具
1

章节 01

Marmot: A Practical Tool for Precise VRAM Estimation in LLM Deployment

Marmot is an open-source command-line tool written in Rust that addresses the VRAM planning dilemma in LLM deployment. It can quickly calculate GPU memory required for deploying LLMs from Hugging Face/ModelScope configs, supporting Dense, MoE, multimodal, and quantized models. It solves common questions like显存需求 for different precisions, KV Cache impact, and MoE model differences, filling the gap in precise pre-deployment resource planning.

2

章节 02

The VRAM Dilemma in LLM Deployment

In LLM deployment, GPU显存规划 is a key but underestimated problem. Developers often face questions: How much VRAM does a 70B parameter model need in FP16? How much can INT4 quantization reduce memory? What's KV Cache overhead for 32K context? Are MoE models like Mixtral different from Dense models? Traditional methods rely on经验法则 or online calculators but lack model-specific precision—Marmot solves this.

3

章节 03

Marmot Tool Overview

Marmot is an open-source Rust tool with core capabilities:

  • Multi-source input: local configs, HTTP URLs, Hugging Face model IDs
  • Auto architecture detection: reads quantization precision and attention mechanisms
  • Full model coverage: Dense (LLaMA, Mistral, Qwen) and MoE (Mixtral, GLM-4)
  • GQA/MQA support
  • Precision comparison mode (FP16, INT8, INT4)
  • Open-source under MIT license.
4

章节 04

Installation & Usage Guide

Installation

  • Via Cargo: cargo install --git https://github.com/fagao-ai/marmot
  • Local build: git clone https://github.com/fagao-ai/marmot && cd marmot && cargo build --release

Basic Usage

  • Model ID: marmot meta-llama/Llama-2-7b-hf
  • Local file: marmot ./config.json
  • HTTP URL: marmot https://huggingface.co/mistralai/Mistral-7B-v0.1/raw/main/config.json

Advanced Usage

  • Context length: marmot meta-llama/Llama-2-7b-hf --context 16384
  • Precision comparison: marmot meta-llama/Llama-2-7b-hf --compare fp16,int8,int4
  • Separate KV Cache precision: marmot meta-llama/Llama-2-7b-hf --precision int4 --kv-dtype fp16
5

章节 05

Core Calculation Principles

Formula: vram = model_weights + kv_cache + runtime_overhead

Model Weights:

  • Dense models: Calculates embedding, attention, FFN, layer norms, lm_head.
  • MoE models: Separates shared components (embedding, attention) and expert FFN.

KV Cache: kv_cache = batch × seq_len × layers × kv_heads × head_dim × 2 × bytes (2 for K/V matrices; GQA reduces kv_heads).

Precision Formats:

Type Bytes/Param Note
FP32 4.0 Full precision
FP16 2.0 Half precision (default)
BF16 2.0 bfloat16
FP8 1.0 8-bit float
INT8 1.0 8-bit quantization
INT4 0.5 4-bit quantization
6

章节 06

Evidence & Validation

Output Examples:

  • Dense model: Mistral-7B with 32K context (FP16: 17.45GB total, weights:14.48GB, KV cache:2.15GB).
  • MoE model: Mixtral-8x7B (INT4 reduces VRAM by ~70% vs FP16).

Supported Models:

  • Dense: LLaMA, Mistral, Qwen, GPT-2.
  • MoE: Mixtral (8x7B/8x22B), GLM-4.
  • Attention: Full, GQA, MQA.

Validation: Estimates have 5-10% error, compatible with vLLM, SGLang, Hugging Face Transformers, TGI. Deviation factors: activation size, memory allocator overhead, framework optimizations.

7

章节 07

Application Scenarios

Marmot is useful for:

  1. Deployment resource planning: Avoid OOM errors.
  2. Quantization strategy selection: Compare precisions to decide.
  3. Context length trade-off: Balance model能力 and hardware limits.
  4. Batch size optimization: Improve throughput.
  5. Multi-model coexistence: Calculate total VRAM for multiple models on one GPU.
8

章节 08

Future Outlook & Summary

Future Directions:

  • Support more model architectures.
  • Add training VRAM estimation.
  • Support more quantization formats (GPTQ, AWQ, GGUF).
  • Estimate activation memory.

Summary: Marmot is a practical tool that helps developers avoid resource shortages, make informed quantization decisions, optimize hardware, and reduce trial costs. It's an essential addition to any LLM deployment toolkit.