Zing Forum

Reading

Kifayati AI: A Cost-Optimized Intelligent Routing System via Hybrid Multi-Model Architecture

An open-source project demonstrating how to assign simple queries to lightweight models and complex queries to powerful models via intelligent routing, reducing AI inference costs by up to 90% while maintaining performance.

LLM成本优化智能路由GemmaGemini混合模型FinOpsKubernetes开源项目
Published 2026-05-20 01:48Recent activity 2026-05-20 02:17Estimated read 8 min
Kifayati AI: A Cost-Optimized Intelligent Routing System via Hybrid Multi-Model Architecture
1

Section 01

[Introduction] Kifayati AI: Cost Optimization via Hybrid Multi-Model Intelligent Routing

Kifayati AI is an open-source project by Google Developer Expert Geeta Kakrani. Its core is an intelligent routing mechanism that assigns simple queries to lightweight models (e.g., Gemma 3:4b) and complex queries to powerful models (e.g., Gemini 2.5 Flash), reducing AI inference costs by up to 90% while maintaining performance. The project name "Kifayati" (meaning "thrifty" in Hindi) reflects its philosophy of seeking the optimal balance between performance and cost.

2

Section 02

Project Background: Why Do We Need a Hybrid Model Architecture?

In the current generative AI field, there is a common phenomenon of using the most powerful models "one-size-fits-all", leading to three major problems: 1. Unsustainable API costs (excessively high fees for simple queries when scaling up); 2. Unnecessary latency (simple queries waiting for large model inference); 3. Wasted computing resources (tasks that don't require advanced capabilities occupy GPU resources). Kifayati AI's solution is to build an intelligent routing system that first evaluates query complexity before assigning it to the appropriate model.

3

Section 03

Core Architecture: Five-Signal Complexity Scoring Engine

The core of Kifayati AI is the QueryEvaluator complexity scoring engine, which calculates a complexity score from 0 to 1 based on five signals: 1. Token count (longer queries are more complex); 2. Complex keyword detection (technical terms, professional concepts, etc.); 3. Inference depth evaluation (whether multi-step logical deduction is needed); 4. Code detection (programming-related queries); 5. Simple query penalty (low scores for greetings, etc.). Queries with a score <0.4 are routed to Gemma 3:4b (approximately $0.00001 per request on Vertex AI), and those with >=0.4 are routed to Gemini 2.5 Flash (approximately $0.0001 per request).

4

Section 04

Key Technical Features

Kifayati AI includes three key features: 1. Intelligent caching: An LRU cache stores 500 historical queries, returning cached results directly for identical requests (zero cost and zero latency), and evicting the least recently used entries when full; 2. Circuit breaker mode: If Gemma fails three times in a row, all traffic is automatically switched to Gemini, and recovery is attempted after 30 seconds; 3. Real-time FinOps monitoring: Real-time viewing of request costs, cumulative savings, latency comparisons, and routing decision reasons via the Streamlit interface.

5

Section 05

Deployment and Scalability

Kifayati AI offers production-grade deployment solutions: 1. RESTful API backend: Built on FastAPI, including inference interfaces, health checks, metric monitoring, circuit breaker status queries, and cache cleaning functions; 2. Kubernetes native support: GKE deployment manifests, configured with Horizontal Pod Autoscaler (auto-scaling 1-5 pods based on CPU load), and secure GCP access via Workload Identity; 3. CI/CD pipeline: GitHub Actions workflow that automates deployment from code submission to Cloud Build and then to GKE.

6

Section 06

Cost-Benefit Analysis (Evidence)

According to the project's benchmark test data, the cost advantage is significant:

Scenario Cost per 1000 Requests
Using only Gemini (baseline) $0.10
Kifayati hybrid solution (70% using Gemma) $0.037
Savings Ratio ~63%
In practical applications, if simple queries dominate, savings can be as high as 90%, which is of significant value for scenarios with a large number of requests such as customer service robots and content generation platforms.
7

Section 07

Practical Application Recommendations

Recommendations for developers to implement similar cost optimization: 1. Query classification strategy: Define complexity standards based on business needs (e.g., e-commerce customer service marks return/refund policy queries as simple and technical support as complex); 2. Progressive deployment: Enable intelligent routing on part of the traffic first, then expand coverage after observing results; 3. Monitoring and tuning: Continuously monitor routing accuracy and adjust scoring thresholds based on feedback (the project's modular design facilitates tuning).

8

Section 08

Conclusion

Kifayati AI demonstrates a feasible path for cost optimization in generative AI applications. Through the combination of intelligent routing, caching strategies, and circuit breaker mode, it significantly reduces operational costs without sacrificing user experience. As LLM applications scale, this "on-demand allocation" architectural thinking will become increasingly important. This project is not only an open-source tool but also an inspiration for architectural design: intelligence in the AI era is reflected not only in model capabilities but also in the wisdom of resource scheduling.