Zing 论坛

正文

Miramar Platform:混合云AI平台的工程实践与架构设计

本文介绍Miramar Platform项目,一个结合本地DGX工作站与GCP云资源的混合AI平台。项目展示了如何通过Terraform、GKE、Workload Identity Federation和自托管GPU Runner构建可复现的MLOps工作流。

混合云AI平台DGX SparkGCPGKEMLOpsTerraformWorkload Identity FederationGitHub Actions自托管RunnerKubeflow
发布时间 2026/06/09 00:16最近活动 2026/06/09 00:20预计阅读 7 分钟
Miramar Platform:混合云AI平台的工程实践与架构设计
1

章节 01

Miramar Platform: Hybrid Cloud AI Platform Core Overview

Miramar Platform is a hybrid AI platform developed by miramar-labs-org (source: GitHub repo https://github.com/miramar-labs-org/miramar-platform-gcp, updated 2026-06-08). It integrates local NVIDIA DGX Spark/Jetson AGX Orin devices with Google Cloud Platform (GCP) resources to solve infrastructure dilemmas in AI development. Key technologies include Terraform, GKE, Workload Identity Federation, self-hosted GPU runners, and MLOps workflows, aiming to build reproducible AI pipelines.

2

章节 02

Project Background & Core Vision

AI teams face two main infrastructure issues: full cloud dependency leads to high GPU costs and data privacy risks; full local deployment lacks elastic scalability. Miramar Platform's core vision is to combine local and cloud resources: sensitive data/model training is done locally, while inference and collaboration are handled on the cloud. This approach balances data privacy and cloud convenience.

3

章节 03

Local Hardware Architecture

The platform uses three heterogeneous local machines:

  1. WSL2 Workstation: Ubuntu 22.04 on Windows laptop, RTX4060 (8GB), acts as GitHub Actions self-hosted runner for lightweight CI/CD tasks.
  2. Jetson AGX Orin: 64GB unified memory, 2048 CUDA cores, Ubuntu22.04 with JetPack6.x (arm64), suitable for edge AI inference and lightweight training.
  3. DGX Spark: 128GB unified memory, GB10 Superchip (6144 CUDA cores,192 Tensor Cores), handles large model fine-tuning and complex training. All machines use the same mlabs-runner Docker image (multi-arch: amd64/arm64) to simplify operations.
4

章节 04

Local AI Software Stack

DGX Spark runs a complete local AI software stack:

  • Minikube: Local Kubernetes for container orchestration.
  • NeMo Microservices: NVIDIA's framework for large model training/fine-tuning/inference.
  • MLflow & MinIO: Experiment tracking/model version management with S3-compatible storage.
  • Qdrant: Vector database for RAG semantic search.
  • Kubeflow Pipelines: Orchestration for complex ML workflows.
  • Ollama & NIM: Local inference services (Ollama for consumer models, NIM for enterprise NVIDIA-optimized models). These services are exposed via SSH tunnels to development workstations for a cloud-like local experience.
5

章节 05

Cloud Architecture (GCP)

The cloud part uses Terraform for infrastructure-as-code (IaC) to ensure reproducibility:

  • GKE Standard Cluster (miramar-shared-gke): Shared Kubernetes layer for various workloads.
  • Artifact Registry (apps repo): Stores application images.
  • GCS Buckets: Persist Terraform state and GKE node pool snapshots.
  • Workload Identity Federation: Enables keyless authentication from GitHub Actions to GCP, enhancing security by avoiding long-term service account keys.
6

章节 06

CI/CD & Project Factory

CI/CD is powered by GitHub Actions:

  • Self-hosted runners: All three local machines are registered as runners (tagged wsl2, dgx, agx) to route tasks needing GPU/local access/arm64 to appropriate machines.
  • Workflow matrix: Covers platform lifecycle (create/destroy/expand/recover), GPU quota management, local AI service deployment, runner image building, WSL2 config, etc. Each workflow has a corresponding destroy/uninstall workflow. Project Factory: Template-based projects auto-get notebook-first dev env (JupyterLab), pre-configured CI/CD, platform integration, local execution config, and standard docs. First template: Kubeflow Pipelines fine-tuning project (local fine-tuning for PHI data, then desensitized model to GCP for inference).
7

章节 07

Engineering Highlights & Applicable Scenarios

Key engineering practices:

  1. Keyless authentication (Workload Identity Federation) reduces credential risks.
  2. Unified multi-arch container image simplifies CI/CD and operations.
  3. Full lifecycle management avoids orphaned cloud resources.
  4. Docs-as-code ensures knowledge transfer. Applicable scenarios:
  • AI teams handling sensitive data (medical/finance).
  • Organizations wanting to reduce cloud GPU costs.
  • Projects needing edge AI capabilities.
  • Teams adopting IaC practices. Conclusion: The hybrid model balances data sovereignty and cloud elasticity, potentially becoming a standard for future AI infrastructure.