# ColorFlow: An End-to-End MLOps Practice Project Based on GKE

> This ZHAW semester project demonstrates how to build a complete MLOps pipeline on Google Kubernetes Engine, covering MLflow, model training, registration, and deployment.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-16T19:15:25.000Z
- 最近活动: 2026-05-16T19:19:33.029Z
- 热度: 152.9
- 关键词: MLOps, MLflow, Google Kubernetes Engine, GKE, Docker, Kubernetes, 机器学习运维, 模型部署, Google Cloud
- 页面链接: https://www.zingnex.cn/en/forum/thread/colorflow-gke-mlops
- Canonical: https://www.zingnex.cn/forum/thread/colorflow-gke-mlops
- Markdown 来源: floors_fallback

---

## ColorFlow Project Introduction

ColorFlow is a semester project for the Machine Learning Operations course at Zurich University of Applied Sciences (ZHAW), aiming to demonstrate how to build an end-to-end MLOps pipeline on Google Kubernetes Engine (GKE). The project covers MLflow experiment tracking, model training, registration, and deployment, supporting seamless switching between local development and cloud deployment. Core technologies include GCS FUSE, MLflow proxy mode, GKE workload identity, etc., providing a reusable architectural template for MLOps practice.

## Project Background

The ColorFlow project originated from the ZHAW Machine Learning Operations course, aiming to address challenges in model deployment, monitoring, updates, and team collaboration in AI application development. It provides a complete process guide from local development to cloud deployment, serving both as a learning outcome and a practical MLOps architectural template.

## Architecture and Dual-Mode Design

### Core Architecture Components
- **Training Service (Trainer)**: Reads data, executes training, saves checkpoints, and tracks parameters/metrics/model files via MLflow
- **MLflow Service**: Metadata storage and model registry
- **Registry Job**: Registers models in MLflow as deployable versions
- **Model Service (MLServer)**: Provides inference APIs
- **User Interface (UI)**: Visualizes model management and testing

### Dual-Mode Design
- **Local Mode**: Data is stored in `storage/mlops-coco` (training data), `storage/mlops-flow` (MLflow artifacts), `storage/mlops-checkpoints` (training checkpoints), supporting rapid iteration
- **GKE Mode**: Uses GCS FUSE to mount GCS buckets to containers: `gs://mlops-flow`→`/outputs/mlruns`, `gs://mlops-checkpoints`→`/outputs/checkpoints`, enabling compatible access to local paths

The design follows the MLOps best practice of "local-first, cloud-scalable".

## Detailed GKE Deployment Process

GKE deployment process includes:
1. **Environment Preparation**: Enable Google Cloud's Container, Storage, Artifact Registry APIs
2. **Storage Configuration**: Create GCS buckets and configure unified access permissions, bind IAM policies (read/write permissions) to service accounts
3. **Image Building**: Use Docker Buildx to build multi-platform images, push to Artifact Registry, and use timestamp tags to avoid caching issues
4. **Service Deployment**: Deploy PostgreSQL (MLflow metadata storage), MLflow, MLServer, and UI in sequence, managed via Kubernetes Deployment and Service
5. **Model Promotion**: Copy local artifacts to GCS buckets via a temporary upload Pod, then register as a new model version in the cluster

## Key Technical Innovations

Key technical innovations of the project:
- **GCS FUSE Integration**: Mount cloud storage buckets as local file systems, allowing applications to run locally/cloud without modification, simplifying maintenance
- **MLflow Proxy Mode**: In GKE environment, MLflow server is configured with `--serve-artifacts`, artifact operations are done via MLflow API, so clients don't need direct access to GCS
- **Workload Identity**: Properly configure GKE workload identity, allowing Pods to access cloud resources with specific service accounts, avoiding hard-coded credentials

## Practical Operations Tips

Practical operations tips:
- Access cluster services locally: `kubectl port-forward`
- Copy files between local and Pod: `kubectl cp`
- Monitor deployment status: `kubectl rollout status`
- View training logs in real time: `kubectl logs -f`

Troubleshooting guidelines include: Fixing workload identity permission issues, cleaning up Slate (retaining cluster deletion resources), and fully deleting the cluster.

## Learning Value and Summary

### Learning Value
ColorFlow demonstrates:
- Methods for deploying ML workloads on Kubernetes
- Design of scalable model training and service architectures
- Seamless migration from local to cloud
- Application of MLflow in experiment tracking and model management

### Summary
With detailed documentation and practical design, ColorFlow has become an excellent learning resource for MLOps. It proves the importance of engineering practices such as clear architecture, comprehensive documentation, version control, and automated deployment for the success of ML projects. It is suitable for students, beginners, and experienced ML engineers to gain valuable insights.
