Reading

Azure GPU Virtual Machine Practice: Complete Solution for Local Deployment of 70B+ Large Models Using 4x V100

This article details how to quickly deploy a virtual machine equipped with 4 NVIDIA V100 GPUs on Azure using Terraform to enable local inference of large language models with over 70B parameters. It covers a complete practical guide from infrastructure deployment, Ollama/vLLM dual-engine comparison testing, to cost optimization and actual performance data.

AzureGPUV100大模型部署TerraformvLLMOllama本地推理LlamaKimi

Published 2026-04-08 16:44Recent activity 2026-04-08 16:49Estimated read 7 min

Azure GPU Virtual Machine Practice: Complete Solution for Local Deployment of 70B+ Large Models Using 4x V100

Section 01

Introduction to Azure GPU Virtual Machine Practice: Complete Solution for 70B+ Large Model Deployment with 4x V100

This article details how to quickly deploy a virtual machine equipped with 4 NVIDIA V100 GPUs on Azure using Terraform to enable local inference of large language models with over 70B parameters. It covers automated infrastructure deployment, Ollama/vLLM dual-engine comparison testing, cost optimization strategies, and actual performance data, providing developers with an efficient large model inference solution under controllable costs.

Section 02

Project Background and Core Objectives

Local deployment of large models faces pain points such as high hardware investment, complex maintenance, and insufficient flexibility. Azure NC series virtual machines provide cloud-based GPU resources, and when combined with Terraform, enable one-click deployment and on-demand start/stop. Project objectives include: automated deployment of 4x V100 virtual machines via Terraform; pre-installation of software stacks like NVIDIA drivers, CUDA, Ollama, and vLLM; provision of a benchmark testing framework to compare inference engine performance; and establishment of a reusable deployment-test-destruction process to optimize costs.

Section 03

Hardware Configuration and Architecture Design

The Azure Standard_NC24s_v3 instance is selected with the following configuration:

Component	Specifications
GPU	4x NVIDIA Tesla V100 (16GB per card, total 64GB)
vCPU	24 cores
Memory	448GB
System Disk	256GB Premium SSD
OS	Ubuntu 22.04 LTS Gen2
Region	Central US Zone1

Theoretically, 64GB of VRAM supports 4-bit quantized 70B models. The V100 has a compute capability of 7.0, which limits the use of some new features (e.g., AWQ quantization).

Section 04

Detailed Deployment Process

Deployment steps: 1. Install Azure CLI locally and authenticate; ensure Terraform ≥1.0 and prepare SSH keys. 2. Execute the Terraform script to automatically create resource groups, virtual networks, security groups, and NC24s_v3 virtual machines. 3. Install NVIDIA 550 drivers, CUDA12.4, Ollama, and vLLM via cloud-init; initialization (including restart) takes approximately 15 minutes. 4. After SSH login, use nvidia-smi to verify GPU recognition.

Section 05

Dual-Engine Inference Comparison and Performance Testing

Ollama Solution: Minimalist experience; start the model with one command (e.g., start-ollama-model richardyoung/kat-dev-72b:Q4_K_M), suitable for single-user interaction, but error rate reaches 99% at high concurrency (32). vLLM Solution: Uses PagedAttention technology, supports OpenAI-compatible API (http://<public-ip>:8000), and performs excellently at high concurrency. Performance comparison (Llama3.3 70B 4-bit quantization):

Concurrency	Ollama (tok/s)	Ollama Error Rate	vLLM (tok/s)	vLLM Error Rate	Speedup Ratio
1	2.6	0%	24.4	0%	9x
8	1.1	92%	100.5	0%	91x
32	0.3	99%	277.6	0%	925x

V100 supports GPTQ 4-bit quantization and requires parameters like --enforce-eager --max-model-len 2048 --max-num-seqs32.

Section 06

Cost Analysis and Optimization Recommendations

The on-demand price of NC24s_v3 is approximately $10 per hour; the "deploy on demand, destroy after use" strategy is recommended. Comparison with A100 instances (Standard_NC24ads_A100_v4, $3.67 per hour):

Configuration	Concurrency 32 tok/s	Hourly Cost	Tokens per Dollar
Qwen3-Coder-30B+A100	1924	$3.67	524
Llama3.3 70B GPTQ+4xV100	278	$10.00	28

The A100 solution is 19 times more efficient than the V100 and supports advanced features like AWQ and FlashAttention2.

Section 07

Recommended Models and Best Practices

Recommended Models:

Kimi-Dev-72B: Achieves 46.8% on SWE-bench, excels at code editing;
Qwen3-Coder30B: MoE architecture, 3.3B active parameters, 64.6% on SWE-bench, can run on a single V100;
Llama3.3 70B: Strong general-purpose capabilities, close to GPT-4o level;
DeepSeek-V3.2 70B Distilled Version: Strong tool calling capabilities, MIT license (commercial-friendly). Best Practices: Use Terraform to manage infrastructure; prioritize vLLM in production environments; regularly check VRAM; use destroy.sh to clean up resources with one click.

Section 08

Summary and Outlook

This project provides a complete cloud-based large model deployment solution from infrastructure to performance testing, clarifying the optimal choice of Ollama/vLLM for different scenarios. The 4x V100 solution provides an entry point for 70B+ model inference for users with limited budgets, while the A100 upgrade path meets higher efficiency needs. With the development of the open-source model ecosystem, such deployment tools will further lower the threshold for large model applications.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15