Reading

Deploying a Private LLM from Scratch: A Complete Practical Guide for GPU Cloud Servers

This article details how to use Terraform and GitHub Actions to automatically deploy a complete LLM service stack on AWS, including the Ollama inference engine, Open WebUI chat interface, multi-engine TTS voice synthesis, and real-time monitoring system.

LLM私有化部署GPUAWSTerraformOllamaTTS语音合成

Published 2026-06-08 19:42Recent activity 2026-06-08 19:52Estimated read 5 min

Section 01

Introduction / Main Post: Deploying a Private LLM from Scratch: A Complete Practical Guide for GPU Cloud Servers

Section 02

Original Author and Source

Original Author/Maintainer: carlosacchi
Source Platform: GitHub
Original Title: self-hosted-llm-guide
Original Link: https://github.com/carlosacchi/self-hosted-llm-guide
Publication Time: June 2026

Section 03

Why Do We Need Private LLM Deployment?

With the rapid development of Large Language Model (LLM) technology, more and more developers and enterprises are considering deploying AI capabilities on their own infrastructure. Private deployment not only addresses data privacy and compliance issues but also provides lower inference latency and more flexible model customization capabilities. However, building a complete LLM service stack from scratch involves multiple complex steps such as GPU driver installation, CUDA configuration, containerized deployment, and network configuration, which is a high barrier for beginners.

The self-hosted-llm-guide project introduced in this article provides a complete automated solution. Through Terraform Infrastructure as Code and GitHub Actions workflows, it enables one-click deployment of a complete technology stack including LLM inference, web interface, voice synthesis, and monitoring system.

Section 04

Overall Technical Architecture

This deployment solution builds a feature-rich AI service environment, with core components including:

Section 05

LLM Inference Layer

Ollama: Serves as the underlying inference engine, responsible for model loading and text generation
Open WebUI: Provides a user-friendly chat interface similar to ChatGPT, supporting multi-model switching and conversation history management

Section 06

Voice Synthesis Layer

The project integrates three TTS engines, covering different application scenarios:

Engine	Number of Voices	GPU Requirement	Best Scenario
Kokoro	9 presets	Optional	Fast, low-latency responses
XTTS-v2	21+ voice cloning	Required	Multilingual, emotional expression
Piper	English + Italian	Not needed	Ultra-lightweight, runs on CPU
VibeVoice	Multi-speaker dialogue synthesis	Required	Long text, podcast style

Section 07

Monitoring and Operations

Netdata: Real-time system monitoring dashboard, displaying GPU utilization, CPU, memory, disk, and network status
Automatic Shutdown Scheduling: EventBridge scheduled task to automatically stop instances every night to save costs

Section 08

Network Architecture

Deployed in a dedicated AWS VPC network (10.42.0.0/16), including public subnets, an internet gateway, and route tables. Security groups enforce strict inbound access control, only allowing traffic from the user's IP to access the following ports:

3000/tcp — Open WebUI chat interface
7860/tcp — Gradio TTS voice synthesis interface
7861/tcp — VibeVoice real-time voice interface
11434/tcp — Ollama REST API interface
19999/tcp — Netdata monitoring dashboard
22/tcp — SSH (optional, only open when a key pair is configured)

Deploying a Private LLM from Scratch: A Complete Practical Guide for GPU Cloud Servers

Introduction / Main Post: Deploying a Private LLM from Scratch: A Complete Practical Guide for GPU Cloud Servers

Original Author and Source

Why Do We Need Private LLM Deployment?

Overall Technical Architecture

LLM Inference Layer

Voice Synthesis Layer

Monitoring and Operations

Network Architecture

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization