Zing Forum

Reading

Gen-Smith: A Unified Multimodal AI Experiment Platform - One-Stop Experience for Image Generation and Speech Synthesis

This article introduces the Gen-Smith project, a multimodal model experiment platform built on Azure AI Foundry. It provides an intuitive web interface to experience features like GPT image generation, FLUX series models, and text-to-speech, helping developers and creators quickly explore the boundaries of generative AI capabilities.

多模态AI图像生成文本转语音Azure AI FoundryGPT ImageFLUXNext.js生成式AI
Published 2026-04-04 13:56Recent activity 2026-04-04 14:20Estimated read 5 min
Gen-Smith: A Unified Multimodal AI Experiment Platform - One-Stop Experience for Image Generation and Speech Synthesis
1

Section 01

Introduction / Main Floor: Gen-Smith: A Unified Multimodal AI Experiment Platform - One-Stop Experience for Image Generation and Speech Synthesis

This article introduces the Gen-Smith project, a multimodal model experiment platform built on Azure AI Foundry. It provides an intuitive web interface to experience features like GPT image generation, FLUX series models, and text-to-speech, helping developers and creators quickly explore the boundaries of generative AI capabilities.

2

Section 02

Project Overview

Gen-Smith is a lightweight multimodal AI experiment platform built on Azure AI Foundry. Its design philosophy is to simplify the access process for multimodal models, allowing developers to quickly get started with experiments without needing to deeply understand the underlying details of each model.

The project supports the following core features:

  • Multi-model image generation (GPT Image, MAI Image, FLUX series)
  • Text-to-speech synthesis (TTS)
  • Image editing and local redrawing
  • Generated history management
3

Section 03

1. Multi-Model Image Generation

Gen-Smith's biggest feature is its support for multiple image generation models, with a dedicated experiment page for each:

GPT Image Series

Supports models like GPT Image 1.5, GPT Image 1, and GPT Image 1 Mini. These models excel in image quality and comprehension, making them suitable for scenarios requiring high-quality outputs.

MAI Image

MAI-Image-2 is Microsoft's image generation model, which has unique advantages in generating images of certain specific styles.

FLUX Series

Supports models like FLUX.2-pro and FLUX.2-flex. FLUX is known for its excellent image quality and diverse styles, making it a popular choice among professional creators.

Each model has an independent configuration page, allowing developers to compare the performance differences of different models under the same prompt.

4

Section 04

2. Text-to-Speech (TTS)

The project integrates the gpt-4o-mini-tts model, supporting the conversion of text into natural and fluent speech. Users can adjust voice style and tone parameters through the interface to find the most suitable voice effect for their needs.

5

Section 05

3. Image Editing Features

Gen-Smith provides a canvas-based mask editor that supports local image editing (inpainting). Users can upload an image, draw a mask on the area that needs modification, then enter a new description to generate the locally modified result. This feature is very useful for image refinement and creative exploration.

6

Section 06

4. Generated History Management

All generated content is recorded, including metadata and thumbnails. Users can easily review previous experiment results, compare the effects of different parameter settings, or batch download the generated content.

7

Section 07

Technical Architecture

Gen-Smith uses a modern web technology stack:

8

Section 08

Frontend Technology

  • Next.js 15: Uses App Router architecture, supporting server-side rendering and client-side interaction
  • React 19: Provides a smooth user interface experience
  • TypeScript: Ensures type safety and maintainability of the code
  • Tailwind CSS: Enables rapid style development and responsive layout
  • Radix UI: Provides accessible basic components