Zing Forum

Reading

Deploying a Private LLM from Scratch: A Complete Practical Guide for GPU Cloud Servers

This article details how to use Terraform and GitHub Actions to automatically deploy a complete LLM service stack on AWS, including the Ollama inference engine, Open WebUI chat interface, multi-engine TTS voice synthesis, and real-time monitoring system.

LLM私有化部署GPUAWSTerraformOllamaTTS语音合成
Published 2026-06-08 19:42Recent activity 2026-06-08 19:52Estimated read 5 min
Deploying a Private LLM from Scratch: A Complete Practical Guide for GPU Cloud Servers
1

Section 01

Introduction / Main Post: Deploying a Private LLM from Scratch: A Complete Practical Guide for GPU Cloud Servers

This article details how to use Terraform and GitHub Actions to automatically deploy a complete LLM service stack on AWS, including the Ollama inference engine, Open WebUI chat interface, multi-engine TTS voice synthesis, and real-time monitoring system.

3

Section 03

Why Do We Need Private LLM Deployment?

With the rapid development of Large Language Model (LLM) technology, more and more developers and enterprises are considering deploying AI capabilities on their own infrastructure. Private deployment not only addresses data privacy and compliance issues but also provides lower inference latency and more flexible model customization capabilities. However, building a complete LLM service stack from scratch involves multiple complex steps such as GPU driver installation, CUDA configuration, containerized deployment, and network configuration, which is a high barrier for beginners.

The self-hosted-llm-guide project introduced in this article provides a complete automated solution. Through Terraform Infrastructure as Code and GitHub Actions workflows, it enables one-click deployment of a complete technology stack including LLM inference, web interface, voice synthesis, and monitoring system.


4

Section 04

Overall Technical Architecture

This deployment solution builds a feature-rich AI service environment, with core components including:

5

Section 05

LLM Inference Layer

  • Ollama: Serves as the underlying inference engine, responsible for model loading and text generation
  • Open WebUI: Provides a user-friendly chat interface similar to ChatGPT, supporting multi-model switching and conversation history management
6

Section 06

Voice Synthesis Layer

The project integrates three TTS engines, covering different application scenarios:

Engine Number of Voices GPU Requirement Best Scenario
Kokoro 9 presets Optional Fast, low-latency responses
XTTS-v2 21+ voice cloning Required Multilingual, emotional expression
Piper English + Italian Not needed Ultra-lightweight, runs on CPU
VibeVoice Multi-speaker dialogue synthesis Required Long text, podcast style
7

Section 07

Monitoring and Operations

  • Netdata: Real-time system monitoring dashboard, displaying GPU utilization, CPU, memory, disk, and network status
  • Automatic Shutdown Scheduling: EventBridge scheduled task to automatically stop instances every night to save costs

8

Section 08

Network Architecture

Deployed in a dedicated AWS VPC network (10.42.0.0/16), including public subnets, an internet gateway, and route tables. Security groups enforce strict inbound access control, only allowing traffic from the user's IP to access the following ports:

  • 3000/tcp — Open WebUI chat interface
  • 7860/tcp — Gradio TTS voice synthesis interface
  • 7861/tcp — VibeVoice real-time voice interface
  • 11434/tcp — Ollama REST API interface
  • 19999/tcp — Netdata monitoring dashboard
  • 22/tcp — SSH (optional, only open when a key pair is configured)