Zing Forum

Reading

Azure GPU Virtual Machine Practice: Complete Solution for Local Deployment of 70B+ Large Models Using 4x V100

This article details how to quickly deploy a virtual machine equipped with 4 NVIDIA V100 GPUs on Azure using Terraform to enable local inference of large language models with over 70B parameters. It covers a complete practical guide from infrastructure deployment, Ollama/vLLM dual-engine comparison testing, to cost optimization and actual performance data.

AzureGPUV100大模型部署TerraformvLLMOllama本地推理LlamaKimi
Published 2026-04-08 16:44Recent activity 2026-04-08 16:49Estimated read 7 min
Azure GPU Virtual Machine Practice: Complete Solution for Local Deployment of 70B+ Large Models Using 4x V100
1

Section 01

Introduction to Azure GPU Virtual Machine Practice: Complete Solution for 70B+ Large Model Deployment with 4x V100

This article details how to quickly deploy a virtual machine equipped with 4 NVIDIA V100 GPUs on Azure using Terraform to enable local inference of large language models with over 70B parameters. It covers automated infrastructure deployment, Ollama/vLLM dual-engine comparison testing, cost optimization strategies, and actual performance data, providing developers with an efficient large model inference solution under controllable costs.

2

Section 02

Project Background and Core Objectives

Local deployment of large models faces pain points such as high hardware investment, complex maintenance, and insufficient flexibility. Azure NC series virtual machines provide cloud-based GPU resources, and when combined with Terraform, enable one-click deployment and on-demand start/stop. Project objectives include: automated deployment of 4x V100 virtual machines via Terraform; pre-installation of software stacks like NVIDIA drivers, CUDA, Ollama, and vLLM; provision of a benchmark testing framework to compare inference engine performance; and establishment of a reusable deployment-test-destruction process to optimize costs.

3

Section 03

Hardware Configuration and Architecture Design

The Azure Standard_NC24s_v3 instance is selected with the following configuration:

Component Specifications
GPU 4x NVIDIA Tesla V100 (16GB per card, total 64GB)
vCPU 24 cores
Memory 448GB
System Disk 256GB Premium SSD
OS Ubuntu 22.04 LTS Gen2
Region Central US Zone1

Theoretically, 64GB of VRAM supports 4-bit quantized 70B models. The V100 has a compute capability of 7.0, which limits the use of some new features (e.g., AWQ quantization).

4

Section 04

Detailed Deployment Process

Deployment steps: 1. Install Azure CLI locally and authenticate; ensure Terraform ≥1.0 and prepare SSH keys. 2. Execute the Terraform script to automatically create resource groups, virtual networks, security groups, and NC24s_v3 virtual machines. 3. Install NVIDIA 550 drivers, CUDA12.4, Ollama, and vLLM via cloud-init; initialization (including restart) takes approximately 15 minutes. 4. After SSH login, use nvidia-smi to verify GPU recognition.

5

Section 05

Dual-Engine Inference Comparison and Performance Testing

Ollama Solution: Minimalist experience; start the model with one command (e.g., start-ollama-model richardyoung/kat-dev-72b:Q4_K_M), suitable for single-user interaction, but error rate reaches 99% at high concurrency (32). vLLM Solution: Uses PagedAttention technology, supports OpenAI-compatible API (http://<public-ip>:8000), and performs excellently at high concurrency. Performance comparison (Llama3.3 70B 4-bit quantization):

Concurrency Ollama (tok/s) Ollama Error Rate vLLM (tok/s) vLLM Error Rate Speedup Ratio
1 2.6 0% 24.4 0% 9x
8 1.1 92% 100.5 0% 91x
32 0.3 99% 277.6 0% 925x

V100 supports GPTQ 4-bit quantization and requires parameters like --enforce-eager --max-model-len 2048 --max-num-seqs32.

6

Section 06

Cost Analysis and Optimization Recommendations

The on-demand price of NC24s_v3 is approximately $10 per hour; the "deploy on demand, destroy after use" strategy is recommended. Comparison with A100 instances (Standard_NC24ads_A100_v4, $3.67 per hour):

Configuration Concurrency 32 tok/s Hourly Cost Tokens per Dollar
Qwen3-Coder-30B+A100 1924 $3.67 524
Llama3.3 70B GPTQ+4xV100 278 $10.00 28

The A100 solution is 19 times more efficient than the V100 and supports advanced features like AWQ and FlashAttention2.

7

Section 07

Recommended Models and Best Practices

Recommended Models:

  • Kimi-Dev-72B: Achieves 46.8% on SWE-bench, excels at code editing;
  • Qwen3-Coder30B: MoE architecture, 3.3B active parameters, 64.6% on SWE-bench, can run on a single V100;
  • Llama3.3 70B: Strong general-purpose capabilities, close to GPT-4o level;
  • DeepSeek-V3.2 70B Distilled Version: Strong tool calling capabilities, MIT license (commercial-friendly). Best Practices: Use Terraform to manage infrastructure; prioritize vLLM in production environments; regularly check VRAM; use destroy.sh to clean up resources with one click.
8

Section 08

Summary and Outlook

This project provides a complete cloud-based large model deployment solution from infrastructure to performance testing, clarifying the optimal choice of Ollama/vLLM for different scenarios. The 4x V100 solution provides an entry point for 70B+ model inference for users with limited budgets, while the A100 upgrade path meets higher efficiency needs. With the development of the open-source model ecosystem, such deployment tools will further lower the threshold for large model applications.