# Multimodal LLM Inference Service Based on Clean Architecture: FastAPI Implementation of Qwen3.5-2B

> Using Clean Architecture, this project provides a multimodal Qwen3.5-2B vision-language model inference service on CPU via FastAPI and llama.cpp, supporting SQLite persistence and a complete REST API.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-28T12:13:37.000Z
- 最近活动: 2026-04-28T12:27:28.239Z
- 热度: 132.8
- 关键词: FastAPI, Clean Architecture, Qwen3.5, 多模态, llama.cpp, GGUF, 视觉语言模型, REST API
- 页面链接: https://www.zingnex.cn/en/forum/thread/clean-architecturellm-qwen3-5-2bfastapi
- Canonical: https://www.zingnex.cn/forum/thread/clean-architecturellm-qwen3-5-2bfastapi
- Markdown 来源: floors_fallback

---

## Multimodal LLM Inference Service Based on Clean Architecture: FastAPI Implementation of Qwen3.5-2B (Introduction)

This project is a multimodal large language model (LLM) inference service based on Clean Architecture design principles. It provides REST API interfaces via the FastAPI framework and runs the quantized Qwen3.5-2B vision-language model on CPU using llama.cpp. Key features include strict architectural layering, complete engineering practices (such as request persistence, image storage, rate limiting), and multimodal inference capabilities, offering a high-quality reference implementation for deploying multimodal LLMs in production environments.

## Project Background and Architectural Philosophy

### Project Overview
This project aims to build a production-ready multimodal LLM inference service, corely following Clean Architecture design principles to ensure architectural purity and maintainability.

### Clean Architecture Practices
- **Layered Dependency Rule**: Adopts a four-layer structure (Presentation Layer/Application Layer/Domain Layer/Data Layer), where each layer only depends on inner layers. The presentation layer uses FastAPI routes and Pydantic validation; the application layer contains business logic and DTOs; the domain layer defines ORM entities and repository interfaces; the data layer handles SQLite and file system operations.
- **Boundary Control**: Ensures clear layered boundaries through dependency injection and code reviews. Endpoints only depend on services; services depend on other services and repositories. Direct import of ORM models is avoided to achieve separation of concerns.

## Enterprise-Grade Features and Sampling Strategy Details

### Complete REST API
Provides four main endpoints: POST /inferences (create inference, support multipart upload), GET /inferences (paginated query of history), GET /inferences/{id} (details), GET /inferences/{id}/image (original image).

### Data Persistence
Inference records (prompts, responses, parameters, etc.) are stored in SQLite; uploaded images are stored in the file system and associated via UUID, supporting audit tracking and result backtracking.

### Security Protection
Built-in rate limiting (30 times/minute by default), image size limit (max 10MB, longest edge 768 pixels), and MIME type whitelist (PNG/JPEG/WebP).

### Sampling Strategy
- Text mode: Temperature=1.0 (moderate creativity), Top-p=1.0 (no truncation), Presence penalty=2.0 (strong repetition penalty)
- Vision-language mode: Temperature=0.7 (low randomness), Top-p=0.8 (moderate truncation), Presence penalty=1.5 (moderate penalty)

## Deployment Configuration and Scalability Support

### Environment Configuration
All settings can be overridden via environment variables/.env, including model selection, context window, number of threads, number of GPU layers, etc., adapting to development/production environments.

### GPU Acceleration
Runs on CPU by default; provides guidelines for enabling CUDA/Vulkan backends (reinstall llama-cpp-python and set CMAKE_ARGS), and accelerates inference by controlling the number of GPU layers loaded via MODEL_N_GPU_LAYERS.

### Model Context Optimization
Qwen3.5-2B uses a hybrid architecture of Gated DeltaNet and sparse Gated Attention. The KV cache growth is nearly constant, so increasing MODEL_N_CTX does not cause quadratic memory expansion issues.

## Engineering Practice Value and Reference Significance

This project demonstrates the full process of encapsulating an LLM into a production-ready service: from architectural design to deployment configuration, from error handling to performance optimization, every detail is carefully considered. For teams looking to deploy multimodal LLMs on their own infrastructure, this project provides a high-quality reference implementation that can be directly reused or used as a basis for custom development.
