Zing Forum

Reading

SkyPhusion LLM: A Multimodal AI Playground Built on a Single Cloudflare Worker

A full-featured multimodal AI playground deployed on a single Cloudflare Worker, supporting 35 chat models, voice conversations, image/video/music generation, RAG (Retrieval-Augmented Generation), and project knowledge base management.

CloudflareAI多模态WorkerRAG语音聊天图像生成视频生成开源
Published 2026-06-13 11:42Recent activity 2026-06-13 11:51Estimated read 7 min
SkyPhusion LLM: A Multimodal AI Playground Built on a Single Cloudflare Worker
1

Section 01

SkyPhusion LLM: A Full-Featured Multimodal AI Playground on Single Cloudflare Worker

SkyPhusion LLM is an impressive full-featured multimodal AI playground deployed entirely on a single Cloudflare Worker. It integrates 35 chat models from 5 providers, supporting hands-free voice chat, image/video/music generation, RAG retrieval, and project knowledge base management. The project (by skyphusion-labs, hosted on GitHub) demonstrates the power of Cloudflare's tech stack—building a rich AI app without complex server architecture, using TypeScript and no extra frameworks.

2

Section 02

Background & Project Overview

Original Source

Project Overview

SkyPhusion LLM is a full-featured multimodal AI playground on a single Cloudflare Worker. It supports 35 chat models from 5 providers, plus voice dialogue, image/video/music generation, TTS/STT, RAG, and project KB management. Its core value lies in showcasing Cloudflare's capabilities—simple deployment with no complex servers, written in TypeScript without extra frameworks.

3

Section 03

Core Technical Architecture

Unified AI Call Interface

Via env.AI.run() binding, supports:

  • Chat (35 models across 5 providers)
  • Visual input (image understanding)
  • Image/video/music generation
  • TTS (Aura-2, MeloTTS)
  • STT (Whisper, Deepgram Nova-3)
  • Streaming voice chat (Deepgram Flux)

Multi-Provider Support

  1. Workers AI: Llama4 Scout, Llama3.x, Qwen3 30B, etc.
  2. Anthropic: Claude Opus 4.8/4.7, Sonnet4.6, Haiku4.5
  3. xAI: Grok4.3, Grok4.20, Grok Build0.1
  4. OpenAI: GPT5.5/5.4/5.4mini, o4-mini
  5. Google Gemini: Gemini3.1 Pro

Infrastructure Components

  • D1: Chat metadata, dialogue history, RAG text blocks
  • R2: Binary files (images, audio, video)
  • Vectorize: RAG embeddings (768D BGE-base)
  • AI Gateway: Observability, caching, rate limits
  • Workflows: Long tasks (video/music generation)
  • Access: User email-based access control
4

Section 04

Key Functional Details

Hands-Free Voice Chat

  • Real-time transcription via Deepgram Flux
  • Model responses via Aura-2 TTS
  • Supports all 35 chat models; history saved like text chats

RAG Features

  • Upload any file (v0.23+) or zip batches (v0.25+)
  • PDF/page, spreadsheet/sheet extraction; others as UTF-8 text
  • Chunked docs embedded with BGE-base, stored in Vectorize/D1
  • Inject top5 relevant blocks into system prompt when enabled

Project & KB Management

  • Group docs/conversations into projects (v0.20+)
  • Per-project system prompt and retrieval scope
  • Docs can belong to multiple projects; move conversations between projects

Image/Video Generation

  • Image models: Google Nano Banana Pro, GPT Image1.5, FLUX2 Klein, etc. (FLUX2 supports 4 reference images)
  • Video models: Google Veo3.1, ByteDance Seedance2.0, MiniMax Hailuo2.3, etc. (via Workflows)

UI Design

  • Focus mode: single-column centered chat, floating input
  • Slide-in sidebar (history, projects, docs)
  • Searchable model selector (v0.111+)
5

Section 05

Security & Privacy Measures

  • Cloudflare Access: Protects the entire Worker URL.
  • User Isolation: Uses Cf-Access-Authenticated-User-Email to isolate conversation history per user.
  • R2 Privacy: R2 objects have customMetadata.user_email—even if UUID is guessed, cross-user access is blocked.
  • Video Optimization: Client-side extracts 8 keyframes instead of uploading full video for visual models.
6

Section 06

Practical Application Value

  1. Cost-Effective: Runs on Cloudflare's free tier for significant AI service scale.
  2. Simplified Deployment: Single Worker deployment—no server cluster management.
  3. Multimodal Unification: One interface for text, image, audio, video.
  4. Scalable: Low-latency access via Cloudflare's global network.
  5. Privacy-Focused: Built-in user isolation and access control.
7

Section 07

Conclusion & Developer Takeaways

SkyPhusion LLM is a technically impressive open-source project that leverages Cloudflare's ecosystem to build a full-featured AI playground on a single Worker. It's an excellent learning case for developers wanting to build edge-based multimodal AI apps—demonstrating unified interface design, multi-provider integration, RAG implementation, and long-task handling via Cloudflare Workflows.