Zing Forum

Reading

Using Terraform to Deploy vLLM on GKE: Enterprise-Grade Large Language Model Inference Infrastructure Practice

This article introduces how to deploy vLLM on Google Kubernetes Engine (GKE) using Terraform, enabling automated model downloading, GPU auto-scaling, and secure Hugging Face Token management, providing enterprises with scalable, high-performance LLM inference infrastructure.

vLLMGKETerraformLLM推理GPUKubernetes基础设施即代码投机解码成本优化
Published 2026-05-26 06:45Recent activity 2026-05-26 06:49Estimated read 8 min
Using Terraform to Deploy vLLM on GKE: Enterprise-Grade Large Language Model Inference Infrastructure Practice
1

Section 01

Introduction: Enterprise-Grade LLM Inference Practice with Terraform-Deployed vLLM on GKE

This article introduces how to deploy vLLM inference infrastructure on Google Kubernetes Engine (GKE) using Terraform, enabling automated model downloading, GPU auto-scaling, and secure Hugging Face Token management. This solution adopts Infrastructure as Code (IaC) to standardize the deployment process, balancing cost optimization (Spot instances) and stability (On-Demand instances), and provides flexible configuration options. The following floors will detail the background, architecture, security practices, deployment steps, operations considerations, and other content.

2

Section 02

Background: Why Do We Need Terraform-Based LLM Inference Infrastructure?

With the widespread application of large language models in production environments, enterprises face challenges in efficiently, securely, and scalably deploying inference services. vLLM + Kubernetes is a mainstream solution, but manual configuration of GPU node pools, storage volumes, model downloads, and other processes is cumbersome and error-prone. As an IaC standard, Terraform can standardize, version, and reproduce the deployment process, solving the above problems.

3

Section 03

Architecture Design: Dual Node Pools and Cost Optimization Strategy

The core design of this Terraform module is a dual node pool architecture:

  • Spot Node Pool: Leverages GCP idle resources, with prices 60-91% lower than standard instances, suitable for batch processing or non-critical tasks (Note: May be reclaimed, not suitable for 100% availability scenarios);
  • On-Demand Node Pool: Serves as a fallback to ensure service runs normally when Spot resources are insufficient;
  • Persistent Storage (PVC): Caches Hugging Face models. After the first download, subsequent Pods can mount directly, reducing cold start time.
4

Section 04

Security Practices: Token Management and Access Control

Multi-layer strategies are adopted for security:

  1. Environment Variable Injection: Hugging Face Token is injected via the TF_VAR_hf_token variable to avoid hardcoding;
  2. Kubernetes Secrets: Tokens are stored as Secrets, accessible only to authorized Pods;
  3. Internal Service Isolation: The vLLM service is exposed as an internal K8s Service by default, without a public IP. Access requires kubectl port-forward, reducing the attack surface.
5

Section 05

Deployment Process and Key Features

Deployment Process

Prerequisites: Enable billing and Container Engine API for the GCP project; install gcloud CLI, Terraform (v1.5+), and kubectl; create a GCS Bucket to store Terraform state files. Core steps: Clone the repository → Set environment variables → Terraform initialization and deployment.

Key Features

  • Model Download Job: An independent Job completes model downloading, avoiding cold start delays for inference Pods;
  • Speculative Decoding: Uses a draft model to quickly generate candidate tokens, which are verified by the main model to improve inference throughput (e.g., Qwen3-32B paired with the Zhihu-ai/Zhi-Create-Qwen3-32B-Eagle3 draft model).
6

Section 06

Configuration Flexibility: Model and Hardware Customization

The module supports flexible configuration:

  • Model Switching: Modify the model_id variable to switch to any vLLM-compatible model;
  • Hardware Configuration: Default uses g2-standard-48 (4x L4 GPU), can switch to a3-highgpu-8g (8x H100 GPU);
  • Parameter Tuning: Exposes parameters like gpu_memory_utilization, vllm_dtype, max_model_len, enable_speculative_decoding;
  • Zero-Replica Startup: Set replicas=0 to complete model download first, then manually scale up to save costs.
7

Section 07

Operations Considerations: Monitoring, Scaling, and Cost Control

Operations considerations:

  • Auto-Scaling: GKE node auto-scaling + Pod horizontal scaling to achieve load adaptation;
  • Cost Control: GPU node costs are high (L4 ~$1.5/hour, H100 ~$15+/hour). It is recommended to configure budget alerts and use Committed Use Discounts;
  • Multi-Model Deployment: Deploy multiple independent model instances in the same cluster via the name_prefix variable.
8

Section 08

Conclusion and Insights for Enterprise Deployment

Conclusion

This project demonstrates the cloud deployment standard for enterprise-grade LLM inference services, providing a reproducible, secure, and cost-optimized solution suitable for startups to validate and large enterprises to scale AI services.

Key Insights

  1. IaC First: Terraform ensures deployments are auditable and rollbackable;
  2. Cost Stratification: Spot + On-Demand balances cost and stability;
  3. Security Built-In: Token management and network isolation ensure security from the infrastructure layer;
  4. Storage Optimization: PVC caching avoids repeated downloads of large models;
  5. Modular Design: Flexible variables support customized deployments. Recommendation: Start with the default configuration, gradually adjust hardware, model, and cost strategies, and prioritize monitoring and cost control.