正文

Miramar Platform：混合云AI平台的工程实践与架构设计

本文介绍Miramar Platform项目，一个结合本地DGX工作站与GCP云资源的混合AI平台。项目展示了如何通过Terraform、GKE、Workload Identity Federation和自托管GPU Runner构建可复现的MLOps工作流。

混合云AI平台DGX SparkGCPGKEMLOpsTerraformWorkload Identity FederationGitHub Actions自托管RunnerKubeflow

发布时间 2026/06/09 00:16最近活动 2026/06/09 00:20预计阅读 7 分钟

章节 01

Miramar Platform: Hybrid Cloud AI Platform Core Overview

Miramar Platform is a hybrid AI platform developed by miramar-labs-org (source: GitHub repo https://github.com/miramar-labs-org/miramar-platform-gcp, updated 2026-06-08). It integrates local NVIDIA DGX Spark/Jetson AGX Orin devices with Google Cloud Platform (GCP) resources to solve infrastructure dilemmas in AI development. Key technologies include Terraform, GKE, Workload Identity Federation, self-hosted GPU runners, and MLOps workflows, aiming to build reproducible AI pipelines.

章节 02

Project Background & Core Vision

AI teams face two main infrastructure issues: full cloud dependency leads to high GPU costs and data privacy risks; full local deployment lacks elastic scalability. Miramar Platform's core vision is to combine local and cloud resources: sensitive data/model training is done locally, while inference and collaboration are handled on the cloud. This approach balances data privacy and cloud convenience.

章节 03

Local Hardware Architecture

The platform uses three heterogeneous local machines:

WSL2 Workstation: Ubuntu 22.04 on Windows laptop, RTX4060 (8GB), acts as GitHub Actions self-hosted runner for lightweight CI/CD tasks.
Jetson AGX Orin: 64GB unified memory, 2048 CUDA cores, Ubuntu22.04 with JetPack6.x (arm64), suitable for edge AI inference and lightweight training.
DGX Spark: 128GB unified memory, GB10 Superchip (6144 CUDA cores,192 Tensor Cores), handles large model fine-tuning and complex training. All machines use the same mlabs-runner Docker image (multi-arch: amd64/arm64) to simplify operations.

章节 04

Local AI Software Stack

DGX Spark runs a complete local AI software stack:

Minikube: Local Kubernetes for container orchestration.
NeMo Microservices: NVIDIA's framework for large model training/fine-tuning/inference.
MLflow & MinIO: Experiment tracking/model version management with S3-compatible storage.
Qdrant: Vector database for RAG semantic search.
Kubeflow Pipelines: Orchestration for complex ML workflows.
Ollama & NIM: Local inference services (Ollama for consumer models, NIM for enterprise NVIDIA-optimized models). These services are exposed via SSH tunnels to development workstations for a cloud-like local experience.

章节 05

Cloud Architecture (GCP)

The cloud part uses Terraform for infrastructure-as-code (IaC) to ensure reproducibility:

GKE Standard Cluster (miramar-shared-gke): Shared Kubernetes layer for various workloads.
Artifact Registry (apps repo): Stores application images.
GCS Buckets: Persist Terraform state and GKE node pool snapshots.
Workload Identity Federation: Enables keyless authentication from GitHub Actions to GCP, enhancing security by avoiding long-term service account keys.

章节 06

CI/CD & Project Factory

CI/CD is powered by GitHub Actions:

Self-hosted runners: All three local machines are registered as runners (tagged wsl2, dgx, agx) to route tasks needing GPU/local access/arm64 to appropriate machines.
Workflow matrix: Covers platform lifecycle (create/destroy/expand/recover), GPU quota management, local AI service deployment, runner image building, WSL2 config, etc. Each workflow has a corresponding destroy/uninstall workflow. Project Factory: Template-based projects auto-get notebook-first dev env (JupyterLab), pre-configured CI/CD, platform integration, local execution config, and standard docs. First template: Kubeflow Pipelines fine-tuning project (local fine-tuning for PHI data, then desensitized model to GCP for inference).

章节 07

Engineering Highlights & Applicable Scenarios

Key engineering practices:

Keyless authentication (Workload Identity Federation) reduces credential risks.
Unified multi-arch container image simplifies CI/CD and operations.
Full lifecycle management avoids orphaned cloud resources.
Docs-as-code ensures knowledge transfer. Applicable scenarios:

AI teams handling sensitive data (medical/finance).
Organizations wanting to reduce cloud GPU costs.
Projects needing edge AI capabilities.
Teams adopting IaC practices. Conclusion: The hybrid model balances data sovereignty and cloud elasticity, potentially becoming a standard for future AI infrastructure.