Pravāha adopts a clear layered architecture, extending from the user interface to the underlying Rust performance core:
Layer 1: Interaction Interface
Provides CLI (based on Typer), FastAPI services, WebSocket real-time communication, and a Textual-based terminal dashboard (TUI), even including pixel-style avatar animations to make the command-line experience more engaging.
Layer 2: Engine Core
AsyncPravahaEngine is the core of asynchronous inference, working with the EventBus event bus and RequestQueue request queue to achieve efficient task scheduling.
Layer 3: Inference Pipeline
Starting from the Tokenizer, it goes through the Scheduler, Decoder, and finally reaches the Sampler, forming a complete inference processing chain.
Layer 4: Memory Plane
This is one of Pravāha's technical highlights. PagedKVCache implements paged KV cache management, BlockManager handles memory block allocation, PrefixTrie (implemented in Rust) supports prefix sharing, LRU Swapping enables intelligent page swapping, and the Preemption mechanism handles priority preemption. This design achieves vLLM-level memory usage efficiency.
Layer 5: Intelligent Cluster (51 Agents)
This is the core feature that distinguishes Pravāha from other inference engines. The 51 agents are divided into four categories: 20 Execution Agents, 12 Audit Agents, 10 Security Agents, and 9 Design Agents. All of them work based on the ReAct (Reasoning + Action) loop, with tool usage capabilities and persistent memory.
Layer 6: Extended Features
Built-in RAG (Retrieval-Augmented Generation) pipeline, visual routing, conversation branching, plugin system, and safety guardrails.
Layer 7: Observability
Integrates Prometheus metrics, Tracer tracking, CostEstimator for cost estimation, and SelfBenchmark self-test tools.
Layer 8: Rust Performance Core
Key components such as BlockAllocator, PrefixTrie, and AllocatorStats are implemented in Rust, achieving near-native performance while maintaining the convenience of Python development.