# Azure GPU Virtual Machine Practice: Complete Solution for Local Deployment of 70B+ Large Models Using 4x V100

> This article details how to quickly deploy a virtual machine equipped with 4 NVIDIA V100 GPUs on Azure using Terraform to enable local inference of large language models with over 70B parameters. It covers a complete practical guide from infrastructure deployment, Ollama/vLLM dual-engine comparison testing, to cost optimization and actual performance data.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-08T08:44:22.000Z
- 最近活动: 2026-04-08T08:49:08.480Z
- 热度: 167.9
- 关键词: Azure, GPU, V100, 大模型部署, Terraform, vLLM, Ollama, 本地推理, Llama, Kimi, 量化, 成本优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/azure-gpu-4x-v100-70b
- Canonical: https://www.zingnex.cn/forum/thread/azure-gpu-4x-v100-70b
- Markdown 来源: floors_fallback

---

## Introduction to Azure GPU Virtual Machine Practice: Complete Solution for 70B+ Large Model Deployment with 4x V100

This article details how to quickly deploy a virtual machine equipped with 4 NVIDIA V100 GPUs on Azure using Terraform to enable local inference of large language models with over 70B parameters. It covers automated infrastructure deployment, Ollama/vLLM dual-engine comparison testing, cost optimization strategies, and actual performance data, providing developers with an efficient large model inference solution under controllable costs.

## Project Background and Core Objectives

Local deployment of large models faces pain points such as high hardware investment, complex maintenance, and insufficient flexibility. Azure NC series virtual machines provide cloud-based GPU resources, and when combined with Terraform, enable one-click deployment and on-demand start/stop. Project objectives include: automated deployment of 4x V100 virtual machines via Terraform; pre-installation of software stacks like NVIDIA drivers, CUDA, Ollama, and vLLM; provision of a benchmark testing framework to compare inference engine performance; and establishment of a reusable deployment-test-destruction process to optimize costs.

## Hardware Configuration and Architecture Design

The Azure Standard_NC24s_v3 instance is selected with the following configuration:
|Component|Specifications|
|---|---|
|GPU|4x NVIDIA Tesla V100 (16GB per card, total 64GB)|
|vCPU|24 cores|
|Memory|448GB|
|System Disk|256GB Premium SSD|
|OS|Ubuntu 22.04 LTS Gen2|
|Region|Central US Zone1|

Theoretically, 64GB of VRAM supports 4-bit quantized 70B models. The V100 has a compute capability of 7.0, which limits the use of some new features (e.g., AWQ quantization).

## Detailed Deployment Process

Deployment steps: 1. Install Azure CLI locally and authenticate; ensure Terraform ≥1.0 and prepare SSH keys. 2. Execute the Terraform script to automatically create resource groups, virtual networks, security groups, and NC24s_v3 virtual machines. 3. Install NVIDIA 550 drivers, CUDA12.4, Ollama, and vLLM via cloud-init; initialization (including restart) takes approximately 15 minutes. 4. After SSH login, use `nvidia-smi` to verify GPU recognition.

## Dual-Engine Inference Comparison and Performance Testing

**Ollama Solution**: Minimalist experience; start the model with one command (e.g., `start-ollama-model richardyoung/kat-dev-72b:Q4_K_M`), suitable for single-user interaction, but error rate reaches 99% at high concurrency (32).
**vLLM Solution**: Uses PagedAttention technology, supports OpenAI-compatible API (`http://<public-ip>:8000`), and performs excellently at high concurrency.
Performance comparison (Llama3.3 70B 4-bit quantization):
|Concurrency|Ollama (tok/s)|Ollama Error Rate|vLLM (tok/s)|vLLM Error Rate|Speedup Ratio|
|---|---|---|---|---|---|
|1|2.6|0%|24.4|0%|9x|
|8|1.1|92%|100.5|0%|91x|
|32|0.3|99%|277.6|0%|925x|

V100 supports GPTQ 4-bit quantization and requires parameters like `--enforce-eager --max-model-len 2048 --max-num-seqs32`.

## Cost Analysis and Optimization Recommendations

The on-demand price of NC24s_v3 is approximately $10 per hour; the "deploy on demand, destroy after use" strategy is recommended. Comparison with A100 instances (Standard_NC24ads_A100_v4, $3.67 per hour):
|Configuration|Concurrency 32 tok/s|Hourly Cost|Tokens per Dollar|
|---|---|---|---|
|Qwen3-Coder-30B+A100|1924|$3.67|524|
|Llama3.3 70B GPTQ+4xV100|278|$10.00|28|

The A100 solution is 19 times more efficient than the V100 and supports advanced features like AWQ and FlashAttention2.

## Recommended Models and Best Practices

**Recommended Models**:
- Kimi-Dev-72B: Achieves 46.8% on SWE-bench, excels at code editing;
- Qwen3-Coder30B: MoE architecture, 3.3B active parameters, 64.6% on SWE-bench, can run on a single V100;
- Llama3.3 70B: Strong general-purpose capabilities, close to GPT-4o level;
- DeepSeek-V3.2 70B Distilled Version: Strong tool calling capabilities, MIT license (commercial-friendly).
**Best Practices**: Use Terraform to manage infrastructure; prioritize vLLM in production environments; regularly check VRAM; use `destroy.sh` to clean up resources with one click.

## Summary and Outlook

This project provides a complete cloud-based large model deployment solution from infrastructure to performance testing, clarifying the optimal choice of Ollama/vLLM for different scenarios. The 4x V100 solution provides an entry point for 70B+ model inference for users with limited budgets, while the A100 upgrade path meets higher efficiency needs. With the development of the open-source model ecosystem, such deployment tools will further lower the threshold for large model applications.
