Zing Forum

Reading

Multimodal LLM Inference Service Based on Clean Architecture: FastAPI Implementation of Qwen3.5-2B

Using Clean Architecture, this project provides a multimodal Qwen3.5-2B vision-language model inference service on CPU via FastAPI and llama.cpp, supporting SQLite persistence and a complete REST API.

FastAPIClean ArchitectureQwen3.5多模态llama.cppGGUF视觉语言模型REST API
Published 2026-04-28 20:13Recent activity 2026-04-28 20:27Estimated read 6 min
Multimodal LLM Inference Service Based on Clean Architecture: FastAPI Implementation of Qwen3.5-2B
1

Section 01

Multimodal LLM Inference Service Based on Clean Architecture: FastAPI Implementation of Qwen3.5-2B (Introduction)

This project is a multimodal large language model (LLM) inference service based on Clean Architecture design principles. It provides REST API interfaces via the FastAPI framework and runs the quantized Qwen3.5-2B vision-language model on CPU using llama.cpp. Key features include strict architectural layering, complete engineering practices (such as request persistence, image storage, rate limiting), and multimodal inference capabilities, offering a high-quality reference implementation for deploying multimodal LLMs in production environments.

2

Section 02

Project Background and Architectural Philosophy

Project Overview

This project aims to build a production-ready multimodal LLM inference service, corely following Clean Architecture design principles to ensure architectural purity and maintainability.

Clean Architecture Practices

  • Layered Dependency Rule: Adopts a four-layer structure (Presentation Layer/Application Layer/Domain Layer/Data Layer), where each layer only depends on inner layers. The presentation layer uses FastAPI routes and Pydantic validation; the application layer contains business logic and DTOs; the domain layer defines ORM entities and repository interfaces; the data layer handles SQLite and file system operations.
  • Boundary Control: Ensures clear layered boundaries through dependency injection and code reviews. Endpoints only depend on services; services depend on other services and repositories. Direct import of ORM models is avoided to achieve separation of concerns.
3

Section 03

Enterprise-Grade Features and Sampling Strategy Details

Complete REST API

Provides four main endpoints: POST /inferences (create inference, support multipart upload), GET /inferences (paginated query of history), GET /inferences/{id} (details), GET /inferences/{id}/image (original image).

Data Persistence

Inference records (prompts, responses, parameters, etc.) are stored in SQLite; uploaded images are stored in the file system and associated via UUID, supporting audit tracking and result backtracking.

Security Protection

Built-in rate limiting (30 times/minute by default), image size limit (max 10MB, longest edge 768 pixels), and MIME type whitelist (PNG/JPEG/WebP).

Sampling Strategy

  • Text mode: Temperature=1.0 (moderate creativity), Top-p=1.0 (no truncation), Presence penalty=2.0 (strong repetition penalty)
  • Vision-language mode: Temperature=0.7 (low randomness), Top-p=0.8 (moderate truncation), Presence penalty=1.5 (moderate penalty)
4

Section 04

Deployment Configuration and Scalability Support

Environment Configuration

All settings can be overridden via environment variables/.env, including model selection, context window, number of threads, number of GPU layers, etc., adapting to development/production environments.

GPU Acceleration

Runs on CPU by default; provides guidelines for enabling CUDA/Vulkan backends (reinstall llama-cpp-python and set CMAKE_ARGS), and accelerates inference by controlling the number of GPU layers loaded via MODEL_N_GPU_LAYERS.

Model Context Optimization

Qwen3.5-2B uses a hybrid architecture of Gated DeltaNet and sparse Gated Attention. The KV cache growth is nearly constant, so increasing MODEL_N_CTX does not cause quadratic memory expansion issues.

5

Section 05

Engineering Practice Value and Reference Significance

This project demonstrates the full process of encapsulating an LLM into a production-ready service: from architectural design to deployment configuration, from error handling to performance optimization, every detail is carefully considered. For teams looking to deploy multimodal LLMs on their own infrastructure, this project provides a high-quality reference implementation that can be directly reused or used as a basis for custom development.