Reading

Hearth: A Declarative Large Model Inference Service Framework on Kubernetes

Introducing the open-source Hearth project, discussing how to implement declarative, auto-scaling-to-zero large language model (LLM) inference services on Kubernetes, and the technical evolution trends of cloud-native AI infrastructure.

Kubernetes大语言模型推理服务Scale-to-Zero云原生LLM自动扩缩容Operator

Published 2026-06-08 19:45Recent activity 2026-06-08 19:58Estimated read 7 min

Section 01

Introduction: Hearth—A Declarative Large Model Inference Service Framework on Kubernetes

This article introduces the open-source Hearth project, discussing how to implement declarative, auto-scaling-to-zero large language model (LLM) inference services on Kubernetes. It addresses resource cost and operational challenges in LLM inference, while analyzing the technical evolution trends of cloud-native AI infrastructure. Key highlights include declarative configuration to simplify operations, Scale-to-Zero to optimize costs, and vendor-neutral design to avoid lock-in.

Section 02

Infrastructure Challenges of LLM Inference

With the widespread application of large language models, inference services face highly fluctuating request loads, strict latency requirements, and high GPU resource costs. Traditional persistent services waste resources during low traffic periods, and manual scaling struggles to handle peak loads. Although Kubernetes is a cloud-native foundation, the characteristics of LLM inference—such as long model loading times, large memory usage, and stateful requests—make its general solutions difficult to apply directly, requiring specialized optimization tools.

Section 03

Core Concepts of Hearth: Declarative and Scale-to-Zero

Declarative Configuration: Through Kubernetes Custom Resource Definitions (CRDs), users only need to describe the target state of the model service (e.g., model source, resource requirements). Hearth handles underlying deployment, scaling, and other logic to simplify operations.

Scale-to-Zero: Scale down to zero to release GPU resources when there are no requests, and trigger rapid scaling upon new requests. While cold start introduces latency, cost savings are significant in non-real-time scenarios (asynchronous batch processing, development and testing).

Section 04

Architecture Design and Technology Selection

Adopting the Kubernetes Operator pattern:

CRD and API Design: The api/v1alpha1 directory defines custom resources, supporting configurations such as model source, inference engine (vLLM/TensorRT-LLM, etc.), resource requirements, and scaling policies.
Controller Implementation: The internal directory monitors resource changes and reconciles actual and desired states, including configuration parsing, K8s resource creation, and scaling rule configuration.
Helm Chart Deployment: charts/hearth provides a Helm Chart to simplify installation, including RBAC permissions, Webhook configurations, etc.

Section 05

Vendor-Neutral Design Philosophy

Emphasizing vendor-neutrality to avoid lock-in:

Model Format Neutrality: Supports Hugging Face Transformers, GGUF, ONNX, etc.
Inference Engine Neutrality: Can switch between vLLM, TensorRT-LLM, TGI, etc.
Infrastructure Neutrality: Based on standard K8s APIs, can run on public clouds, private clouds, or edge environments.

Section 06

Technical Challenges of Scale-to-Zero

Implementing Scale-to-Zero requires solving:

Cold Start Latency: Mitigated via model caching, layered loading, preloading daemons, and request queuing/batch processing.
Request Routing: Using proxies like Knative Serving to receive requests and trigger scaling.
State Management: Designing state persistence strategies to ensure recovery of context such as conversation history and KV cache after scaling.

Section 07

Applicable Scenarios and Limitations

Applicable Scenarios: Development and testing environments (reducing resource costs), low-frequency batch processing tasks (task-triggered scaling), multi-tenant services (on-demand resource allocation).

Limitations: High-concurrency, low-latency production services still require persistent instances. Hearth supports multiple deployment modes for users to choose from.

Section 08

Open-Source Significance and Community Value

Value of Hearth's open-source:

Provides production-grade reference implementations, offering a starting point and benchmark for teams to evaluate technical solutions.
The open-source model aggregates community best practices to form comprehensive solutions.
Represents the cloud-native AI direction, treating AI workloads as first-class citizens to enhance automation, observability, and portability.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49