Section 01
Multimodal LLM Inference Service Based on Clean Architecture: FastAPI Implementation of Qwen3.5-2B (Introduction)
This project is a multimodal large language model (LLM) inference service based on Clean Architecture design principles. It provides REST API interfaces via the FastAPI framework and runs the quantized Qwen3.5-2B vision-language model on CPU using llama.cpp. Key features include strict architectural layering, complete engineering practices (such as request persistence, image storage, rate limiting), and multimodal inference capabilities, offering a high-quality reference implementation for deploying multimodal LLMs in production environments.