# Using Terraform to Deploy vLLM on GKE: Enterprise-Grade Large Language Model Inference Infrastructure Practice

> This article introduces how to deploy vLLM on Google Kubernetes Engine (GKE) using Terraform, enabling automated model downloading, GPU auto-scaling, and secure Hugging Face Token management, providing enterprises with scalable, high-performance LLM inference infrastructure.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-25T22:45:02.000Z
- 最近活动: 2026-05-25T22:49:10.324Z
- 热度: 161.9
- 关键词: vLLM, GKE, Terraform, LLM推理, GPU, Kubernetes, 基础设施即代码, 投机解码, 成本优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/terraform-gke-vllm
- Canonical: https://www.zingnex.cn/forum/thread/terraform-gke-vllm
- Markdown 来源: floors_fallback

---

## Introduction: Enterprise-Grade LLM Inference Practice with Terraform-Deployed vLLM on GKE

This article introduces how to deploy vLLM inference infrastructure on Google Kubernetes Engine (GKE) using Terraform, enabling automated model downloading, GPU auto-scaling, and secure Hugging Face Token management. This solution adopts Infrastructure as Code (IaC) to standardize the deployment process, balancing cost optimization (Spot instances) and stability (On-Demand instances), and provides flexible configuration options. The following floors will detail the background, architecture, security practices, deployment steps, operations considerations, and other content.

## Background: Why Do We Need Terraform-Based LLM Inference Infrastructure?

With the widespread application of large language models in production environments, enterprises face challenges in efficiently, securely, and scalably deploying inference services. vLLM + Kubernetes is a mainstream solution, but manual configuration of GPU node pools, storage volumes, model downloads, and other processes is cumbersome and error-prone. As an IaC standard, Terraform can standardize, version, and reproduce the deployment process, solving the above problems.

## Architecture Design: Dual Node Pools and Cost Optimization Strategy

The core design of this Terraform module is a dual node pool architecture:
- **Spot Node Pool**: Leverages GCP idle resources, with prices 60-91% lower than standard instances, suitable for batch processing or non-critical tasks (Note: May be reclaimed, not suitable for 100% availability scenarios);
- **On-Demand Node Pool**: Serves as a fallback to ensure service runs normally when Spot resources are insufficient;
- **Persistent Storage (PVC)**: Caches Hugging Face models. After the first download, subsequent Pods can mount directly, reducing cold start time.

## Security Practices: Token Management and Access Control

Multi-layer strategies are adopted for security:
1. **Environment Variable Injection**: Hugging Face Token is injected via the `TF_VAR_hf_token` variable to avoid hardcoding;
2. **Kubernetes Secrets**: Tokens are stored as Secrets, accessible only to authorized Pods;
3. **Internal Service Isolation**: The vLLM service is exposed as an internal K8s Service by default, without a public IP. Access requires `kubectl port-forward`, reducing the attack surface.

## Deployment Process and Key Features

### Deployment Process
Prerequisites: Enable billing and Container Engine API for the GCP project; install gcloud CLI, Terraform (v1.5+), and kubectl; create a GCS Bucket to store Terraform state files.
Core steps: Clone the repository → Set environment variables → Terraform initialization and deployment.
### Key Features
- **Model Download Job**: An independent Job completes model downloading, avoiding cold start delays for inference Pods;
- **Speculative Decoding**: Uses a draft model to quickly generate candidate tokens, which are verified by the main model to improve inference throughput (e.g., Qwen3-32B paired with the Zhihu-ai/Zhi-Create-Qwen3-32B-Eagle3 draft model).

## Configuration Flexibility: Model and Hardware Customization

The module supports flexible configuration:
- **Model Switching**: Modify the `model_id` variable to switch to any vLLM-compatible model;
- **Hardware Configuration**: Default uses `g2-standard-48` (4x L4 GPU), can switch to `a3-highgpu-8g` (8x H100 GPU);
- **Parameter Tuning**: Exposes parameters like `gpu_memory_utilization`, `vllm_dtype`, `max_model_len`, `enable_speculative_decoding`;
- **Zero-Replica Startup**: Set `replicas=0` to complete model download first, then manually scale up to save costs.

## Operations Considerations: Monitoring, Scaling, and Cost Control

Operations considerations:
- **Auto-Scaling**: GKE node auto-scaling + Pod horizontal scaling to achieve load adaptation;
- **Cost Control**: GPU node costs are high (L4 ~$1.5/hour, H100 ~$15+/hour). It is recommended to configure budget alerts and use Committed Use Discounts;
- **Multi-Model Deployment**: Deploy multiple independent model instances in the same cluster via the `name_prefix` variable.

## Conclusion and Insights for Enterprise Deployment

### Conclusion
This project demonstrates the cloud deployment standard for enterprise-grade LLM inference services, providing a reproducible, secure, and cost-optimized solution suitable for startups to validate and large enterprises to scale AI services.
### Key Insights
1. **IaC First**: Terraform ensures deployments are auditable and rollbackable;
2. **Cost Stratification**: Spot + On-Demand balances cost and stability;
3. **Security Built-In**: Token management and network isolation ensure security from the infrastructure layer;
4. **Storage Optimization**: PVC caching avoids repeated downloads of large models;
5. **Modular Design**: Flexible variables support customized deployments.
Recommendation: Start with the default configuration, gradually adjust hardware, model, and cost strategies, and prioritize monitoring and cost control.
