Zing Forum

Reading

STREAM: A Three-Tier LLM Inference Architecture Unifying Local, HPC, and Cloud Environments

STREAM achieves unified scheduling of local, high-performance computing (HPC) center, and commercial cloud API resources through intelligent hierarchical routing and a dual-channel HPC streaming architecture. While ensuring data privacy, it reduces the first-token latency of HPC inference from 11.4 seconds to 0.54 seconds.

LLM推理HPC分层架构流式传输成本优化隐私保护
Published 2026-06-12 07:20Recent activity 2026-06-15 09:18Estimated read 6 min
STREAM: A Three-Tier LLM Inference Architecture Unifying Local, HPC, and Cloud Environments
1

Section 01

[Introduction] STREAM: A Three-Tier LLM Inference Architecture Unifying Local, HPC, and Cloud Environments

STREAM is a three-tier architecture system addressing the resource fragmentation issue in LLM inference. It实现s unified scheduling of local, high-performance computing (HPC) center, and commercial cloud API resources via intelligent hierarchical routing and a dual-channel HPC streaming architecture. Its core value lies in reducing the first-token latency of HPC inference from 11.4 seconds to 0.54 seconds while ensuring data privacy, striking an optimal balance between cost, performance, and privacy.

2

Section 02

Background: The Dilemma of Fragmented LLM Inference Ecosystem

Current LLM users face a triple dilemma:

  • Local Deployment: Free and private, but hardware limitations prevent running large models or long contexts;
  • Institutional HPC: Strong resources with data retained within the institution, but designed for batch processing jobs rather than interactive use;
  • Commercial Cloud API: On-demand service but with high costs and privacy risks. The three types of resources each have their pros and cons, and there is no unified system allowing users to choose flexibly, forcing trade-offs between convenience, cost, and security.
3

Section 03

Core Architecture 1: Intelligent Three-Tier Routing and Complexity Judgment

The core of STREAM is the intelligent routing layer, integrating local, HPC, and cloud resources:

  • Equipped with a local lightweight LLM complexity judge that analyzes query complexity in milliseconds;
  • Simple queries → local, medium → HPC, complex → cloud;
  • Avoids one-size-fits-all strategies to achieve optimal resource allocation.
4

Section 04

Core Architecture 2: Dual-Channel HPC Streaming Architecture Breaks Firewall Limitations

To address HPC firewall issues, STREAM adopts a dual-channel design:

  • Control Plane: Globus Compute handles authentication and scheduling;
  • Data Plane: WebSocket relay transmits tokens without modifying network configurations;
  • Effect: First-token latency reduced from 11.4 seconds to 0.54 seconds (21.1x improvement), with end-to-end AES-256-GCM encryption ensuring privacy.
5

Section 05

Core Architecture 3: Context Awareness and HPC-as-API Mode

Solves the problem of resource waste in long conversations:

  • Context-Aware Level Retention: Intelligently compresses historical conversations to prevent simple queries from being moved to high-cost tiers;
  • HPC-as-API: Encapsulates HPC into an OpenAI-compatible API, allowing users to call it without professional HPC knowledge and breaking the latency limits of traditional batch processing.
6

Section 06

Performance Evaluation: 85%+ Retention Rate in Free Tier and Significant Latency Optimization

Benchmark test results (1200 queries across 10 domains):

  • When using the Llama3.2 3B local model, 85.1% of queries are completed in the free tier;
  • First-token latency comparison: Local (0.26s), HPC streaming (0.54s), commercial cloud API (1.68s);
  • HPC mode latency is better than cloud, benefiting from high-performance hardware and optimized paths.
7

Section 07

Practical Significance: Dual Reduction in Compliance and Cost, Democratizing HPC Resources

STREAM's value for academia and institutions:

  • Compliance: Sensitive data stays in institutional HPC without third-party cloud involvement;
  • Cost: 85% free queries save budget;
  • Education Scenarios: HPC-as-API lowers the barrier, allowing students and teachers to use HPC like ChatGPT;
  • Technical Paradigm: Demonstrates hybrid intelligent collaboration ideas, providing references for resource-constrained scenarios.
8

Section 08

Limitations and Future Directions

Current Limitations:

  • Training data and generalization ability of the complexity judge are not detailed;
  • WebSocket relay has single-point failure risk. Future Directions:
  • Introduce more tiers like edge computing;
  • Support tiered inference for multi-modal models;
  • Develop adaptive complexity threshold adjustment mechanisms.