# FIRST: Federated Inference Resource Scheduling Toolkit for Scientific Computing

> FIRST (Federated Inference Resource Scheduling Toolkit) is an open-source inference gateway developed by Argonne National Laboratory. It provides secure and scalable large language model (LLM) inference services for scientific computing clusters via OpenAI-compatible APIs, supporting both batch and interactive modes.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-01T19:44:32.000Z
- 最近活动: 2026-04-01T19:56:27.692Z
- 热度: 159.8
- 关键词: 科学计算, 推理网关, HPC, 联邦学习, LLM推理, vLLM, Globus, 私有化部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/first
- Canonical: https://www.zingnex.cn/forum/thread/first
- Markdown 来源: floors_fallback

---

## FIRST: Federated Inference Resource Scheduling Toolkit for Scientific Computing (Introduction)

FIRST (Federated Inference Resource Scheduling Toolkit) is an open-source inference gateway developed by Argonne National Laboratory. It aims to address the core challenge faced by research institutions: leveraging high-performance computing (HPC) infrastructure for large language model (LLM) inference while protecting data privacy. This toolkit provides secure and scalable inference services via OpenAI-compatible APIs, supporting both batch and interactive modes. It uses a federated architecture to enable cross-cluster resource scheduling, offering a private AI inference solution for the scientific computing domain.

## Project Background and Positioning

With the widespread application of LLMs in scientific research, research institutions face a conflict between the risk of sensitive data leakage and the utilization of HPC resources: commercial cloud APIs are convenient but data security is hard to guarantee. FIRST emerged as an open-source project offering an "inference-as-a-service" model, allowing researchers to run parallel inference workloads in a private and secure environment.

## Core Architecture and Key Features

### Core Architecture
- **API Gateway Layer**: Based on the Django framework, responsible for request validation, identity authentication (Globus Auth), permission control, and routing
- **Authentication and Authorization**: Integrates Globus Auth, supporting institutional account login, SSO, and multi-factor authentication
- **Compute Execution Layer**: Enables remote execution across distributed HPC clusters via Globus Compute, supporting resource elasticity and multi-model routing
- **Inference Backend**: Mainly integrates vLLM, supports PagedAttention optimization, and the architecture is extensible to other engines

### Key Features
- OpenAI-compatible API: Seamless switching with existing SDKs, supporting interfaces like chat completions and embeddings
- Dual-mode inference: Interactive mode (low latency, streaming output) and batch mode (high throughput, asynchronous processing)
- Auto-scaling: Load-aware scheduling, preheating mechanism, and fault recovery
- Multi-cluster federation: Cross-regional deployment, load balancing, and fault isolation

## Performance and Application Scenarios

### Performance Data
- Daily token generation: Billions of tokens per day
- GPU utilization in batch mode: Over 90%
- Average response time in interactive mode: Less than 1 second
- Concurrent support: Hundreds of requests

### Application Scenarios
- Large-scale literature analysis: Extract key findings, generate reviews, and build knowledge graphs
- Experimental data analysis: Process logs, extract structured information, and generate reports
- Code generation assistance: Convert mathematical formulas to code, optimize parallelization, and generate documentation
- Multimodal scientific data: Image annotation, cell feature extraction, and astronomical image analysis

## Security Compliance and Solution Comparison

### Security and Compliance
- Data privacy: Local execution, encrypted transmission, access auditing, and data isolation
- Compliance support: GDPR-compliant, HIPAA-ready, and export control compliant

### Solution Comparison
#### vs Commercial Cloud APIs
| Feature | FIRST | Commercial Cloud API |
|---|---|---|
| Data privacy | Data never leaves the institution | Data uploaded to the cloud |
| Cost | Utilizes existing HPC resources | Pay-per-token |
| Customization | Fully controllable | Limited by service provider |
| Latency | Local network | Internet latency |

#### vs Self-Deployed vLLM
| Feature | FIRST | Direct vLLM Deployment |
|---|---|---|
| Authentication and Authorization | Enterprise-grade | Need to implement independently |
| Multi-cluster | Natively supported | Requires additional development |
| Batch processing | Built-in support | Need to implement independently |

## Deployment Options and Community Ecosystem

### Deployment Options
- **Docker Deployment**: Quick start for testing, command: `docker pull auroragpt/first-gateway && docker run -p 8000:8000 auroragpt/first-gateway`
- **Bare-metal Deployment**: For production environments with high-performance requirements, deploy directly on HPC cluster login nodes

### Community Ecosystem
- Open-source license: Apache 2.0 (free for commercial use, modification, and distribution)
- Academic citation: Supports citation in scientific papers (bibtex format available in the original text)
- Community contributions: Code enhancements, documentation improvements, use case sharing, and issue feedback

## Limitations, Countermeasures, and Future Directions

### Limitations
- Higher deployment complexity than cloud APIs
- Requires GPU resources, which is a heavy burden for small institutions
- Community ecosystem is still evolving

### Countermeasures
- Managed services: Shared infrastructure
- Hybrid deployment: Use FIRST for sensitive data, cloud APIs for general queries
- Gradual adoption: Expand from single node

### Future Directions
- Technical evolution: Integrate TensorRT-LLM/DeepSpeed, model version management, enhanced monitoring, edge deployment
- Ecosystem development: Scientific model marketplace, Jupyter/RStudio integration, training resources

## Summary and Outlook

FIRST achieves deep integration of scientific research infrastructure and AI technology, resolving the core conflict between "AI efficiency improvement" and "data security protection". Through its federated architecture, enterprise-grade security authentication, and HPC integration, it provides a private inference solution for scientific computing. As the community grows, FIRST is expected to become an important component of AI infrastructure for scientific research.