Reading

FIRST: Federated Inference Resource Scheduling Toolkit for Scientific Computing

FIRST (Federated Inference Resource Scheduling Toolkit) is an open-source inference gateway developed by Argonne National Laboratory. It provides secure and scalable large language model (LLM) inference services for scientific computing clusters via OpenAI-compatible APIs, supporting both batch and interactive modes.

科学计算推理网关HPC联邦学习LLM推理vLLMGlobus私有化部署

Published 2026-04-02 03:44Recent activity 2026-04-02 03:56Estimated read 9 min

Section 01

FIRST: Federated Inference Resource Scheduling Toolkit for Scientific Computing (Introduction)

FIRST (Federated Inference Resource Scheduling Toolkit) is an open-source inference gateway developed by Argonne National Laboratory. It aims to address the core challenge faced by research institutions: leveraging high-performance computing (HPC) infrastructure for large language model (LLM) inference while protecting data privacy. This toolkit provides secure and scalable inference services via OpenAI-compatible APIs, supporting both batch and interactive modes. It uses a federated architecture to enable cross-cluster resource scheduling, offering a private AI inference solution for the scientific computing domain.

Section 02

Project Background and Positioning

With the widespread application of LLMs in scientific research, research institutions face a conflict between the risk of sensitive data leakage and the utilization of HPC resources: commercial cloud APIs are convenient but data security is hard to guarantee. FIRST emerged as an open-source project offering an "inference-as-a-service" model, allowing researchers to run parallel inference workloads in a private and secure environment.

Section 03

Core Architecture and Key Features

Core Architecture

API Gateway Layer: Based on the Django framework, responsible for request validation, identity authentication (Globus Auth), permission control, and routing
Authentication and Authorization: Integrates Globus Auth, supporting institutional account login, SSO, and multi-factor authentication
Compute Execution Layer: Enables remote execution across distributed HPC clusters via Globus Compute, supporting resource elasticity and multi-model routing
Inference Backend: Mainly integrates vLLM, supports PagedAttention optimization, and the architecture is extensible to other engines

Key Features

OpenAI-compatible API: Seamless switching with existing SDKs, supporting interfaces like chat completions and embeddings
Dual-mode inference: Interactive mode (low latency, streaming output) and batch mode (high throughput, asynchronous processing)
Auto-scaling: Load-aware scheduling, preheating mechanism, and fault recovery
Multi-cluster federation: Cross-regional deployment, load balancing, and fault isolation

Section 04

Performance and Application Scenarios

Performance Data

Daily token generation: Billions of tokens per day
GPU utilization in batch mode: Over 90%
Average response time in interactive mode: Less than 1 second
Concurrent support: Hundreds of requests

Application Scenarios

Large-scale literature analysis: Extract key findings, generate reviews, and build knowledge graphs
Experimental data analysis: Process logs, extract structured information, and generate reports
Code generation assistance: Convert mathematical formulas to code, optimize parallelization, and generate documentation
Multimodal scientific data: Image annotation, cell feature extraction, and astronomical image analysis

Section 05

Security Compliance and Solution Comparison

Security and Compliance

Data privacy: Local execution, encrypted transmission, access auditing, and data isolation
Compliance support: GDPR-compliant, HIPAA-ready, and export control compliant

Solution Comparison

vs Commercial Cloud APIs

Feature	FIRST	Commercial Cloud API
Data privacy	Data never leaves the institution	Data uploaded to the cloud
Cost	Utilizes existing HPC resources	Pay-per-token
Customization	Fully controllable	Limited by service provider
Latency	Local network	Internet latency

vs Self-Deployed vLLM

Feature	FIRST	Direct vLLM Deployment
Authentication and Authorization	Enterprise-grade	Need to implement independently
Multi-cluster	Natively supported	Requires additional development
Batch processing	Built-in support	Need to implement independently

Section 06

Deployment Options and Community Ecosystem

Deployment Options

Docker Deployment: Quick start for testing, command: docker pull auroragpt/first-gateway && docker run -p 8000:8000 auroragpt/first-gateway
Bare-metal Deployment: For production environments with high-performance requirements, deploy directly on HPC cluster login nodes

Community Ecosystem

Open-source license: Apache 2.0 (free for commercial use, modification, and distribution)
Academic citation: Supports citation in scientific papers (bibtex format available in the original text)
Community contributions: Code enhancements, documentation improvements, use case sharing, and issue feedback

Section 07

Limitations, Countermeasures, and Future Directions

Limitations

Higher deployment complexity than cloud APIs
Requires GPU resources, which is a heavy burden for small institutions
Community ecosystem is still evolving

Countermeasures

Managed services: Shared infrastructure
Hybrid deployment: Use FIRST for sensitive data, cloud APIs for general queries
Gradual adoption: Expand from single node

Future Directions

Technical evolution: Integrate TensorRT-LLM/DeepSpeed, model version management, enhanced monitoring, edge deployment
Ecosystem development: Scientific model marketplace, Jupyter/RStudio integration, training resources

Section 08

Summary and Outlook

FIRST achieves deep integration of scientific research infrastructure and AI technology, resolving the core conflict between "AI efficiency improvement" and "data security protection". Through its federated architecture, enterprise-grade security authentication, and HPC integration, it provides a private inference solution for scientific computing. As the community grows, FIRST is expected to become an important component of AI infrastructure for scientific research.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15