Zing Forum

Reading

FIRST: Federated Inference Resource Scheduling Toolkit for Scientific Computing

FIRST (Federated Inference Resource Scheduling Toolkit) is an open-source inference gateway developed by Argonne National Laboratory. It provides secure and scalable large language model (LLM) inference services for scientific computing clusters via OpenAI-compatible APIs, supporting both batch and interactive modes.

科学计算推理网关HPC联邦学习LLM推理vLLMGlobus私有化部署
Published 2026-04-02 03:44Recent activity 2026-04-02 03:56Estimated read 9 min
FIRST: Federated Inference Resource Scheduling Toolkit for Scientific Computing
1

Section 01

FIRST: Federated Inference Resource Scheduling Toolkit for Scientific Computing (Introduction)

FIRST (Federated Inference Resource Scheduling Toolkit) is an open-source inference gateway developed by Argonne National Laboratory. It aims to address the core challenge faced by research institutions: leveraging high-performance computing (HPC) infrastructure for large language model (LLM) inference while protecting data privacy. This toolkit provides secure and scalable inference services via OpenAI-compatible APIs, supporting both batch and interactive modes. It uses a federated architecture to enable cross-cluster resource scheduling, offering a private AI inference solution for the scientific computing domain.

2

Section 02

Project Background and Positioning

With the widespread application of LLMs in scientific research, research institutions face a conflict between the risk of sensitive data leakage and the utilization of HPC resources: commercial cloud APIs are convenient but data security is hard to guarantee. FIRST emerged as an open-source project offering an "inference-as-a-service" model, allowing researchers to run parallel inference workloads in a private and secure environment.

3

Section 03

Core Architecture and Key Features

Core Architecture

  • API Gateway Layer: Based on the Django framework, responsible for request validation, identity authentication (Globus Auth), permission control, and routing
  • Authentication and Authorization: Integrates Globus Auth, supporting institutional account login, SSO, and multi-factor authentication
  • Compute Execution Layer: Enables remote execution across distributed HPC clusters via Globus Compute, supporting resource elasticity and multi-model routing
  • Inference Backend: Mainly integrates vLLM, supports PagedAttention optimization, and the architecture is extensible to other engines

Key Features

  • OpenAI-compatible API: Seamless switching with existing SDKs, supporting interfaces like chat completions and embeddings
  • Dual-mode inference: Interactive mode (low latency, streaming output) and batch mode (high throughput, asynchronous processing)
  • Auto-scaling: Load-aware scheduling, preheating mechanism, and fault recovery
  • Multi-cluster federation: Cross-regional deployment, load balancing, and fault isolation
4

Section 04

Performance and Application Scenarios

Performance Data

  • Daily token generation: Billions of tokens per day
  • GPU utilization in batch mode: Over 90%
  • Average response time in interactive mode: Less than 1 second
  • Concurrent support: Hundreds of requests

Application Scenarios

  • Large-scale literature analysis: Extract key findings, generate reviews, and build knowledge graphs
  • Experimental data analysis: Process logs, extract structured information, and generate reports
  • Code generation assistance: Convert mathematical formulas to code, optimize parallelization, and generate documentation
  • Multimodal scientific data: Image annotation, cell feature extraction, and astronomical image analysis
5

Section 05

Security Compliance and Solution Comparison

Security and Compliance

  • Data privacy: Local execution, encrypted transmission, access auditing, and data isolation
  • Compliance support: GDPR-compliant, HIPAA-ready, and export control compliant

Solution Comparison

vs Commercial Cloud APIs

Feature FIRST Commercial Cloud API
Data privacy Data never leaves the institution Data uploaded to the cloud
Cost Utilizes existing HPC resources Pay-per-token
Customization Fully controllable Limited by service provider
Latency Local network Internet latency

vs Self-Deployed vLLM

Feature FIRST Direct vLLM Deployment
Authentication and Authorization Enterprise-grade Need to implement independently
Multi-cluster Natively supported Requires additional development
Batch processing Built-in support Need to implement independently
6

Section 06

Deployment Options and Community Ecosystem

Deployment Options

  • Docker Deployment: Quick start for testing, command: docker pull auroragpt/first-gateway && docker run -p 8000:8000 auroragpt/first-gateway
  • Bare-metal Deployment: For production environments with high-performance requirements, deploy directly on HPC cluster login nodes

Community Ecosystem

  • Open-source license: Apache 2.0 (free for commercial use, modification, and distribution)
  • Academic citation: Supports citation in scientific papers (bibtex format available in the original text)
  • Community contributions: Code enhancements, documentation improvements, use case sharing, and issue feedback
7

Section 07

Limitations, Countermeasures, and Future Directions

Limitations

  • Higher deployment complexity than cloud APIs
  • Requires GPU resources, which is a heavy burden for small institutions
  • Community ecosystem is still evolving

Countermeasures

  • Managed services: Shared infrastructure
  • Hybrid deployment: Use FIRST for sensitive data, cloud APIs for general queries
  • Gradual adoption: Expand from single node

Future Directions

  • Technical evolution: Integrate TensorRT-LLM/DeepSpeed, model version management, enhanced monitoring, edge deployment
  • Ecosystem development: Scientific model marketplace, Jupyter/RStudio integration, training resources
8

Section 08

Summary and Outlook

FIRST achieves deep integration of scientific research infrastructure and AI technology, resolving the core conflict between "AI efficiency improvement" and "data security protection". Through its federated architecture, enterprise-grade security authentication, and HPC integration, it provides a private inference solution for scientific computing. As the community grows, FIRST is expected to become an important component of AI infrastructure for scientific research.