Zing Forum

Reading

Hearth: A Declarative Large Model Inference Service Framework on Kubernetes

Introducing the open-source Hearth project, discussing how to implement declarative, auto-scaling-to-zero large language model (LLM) inference services on Kubernetes, and the technical evolution trends of cloud-native AI infrastructure.

Kubernetes大语言模型推理服务Scale-to-Zero云原生LLM自动扩缩容Operator
Published 2026-06-08 19:45Recent activity 2026-06-08 19:58Estimated read 7 min
Hearth: A Declarative Large Model Inference Service Framework on Kubernetes
1

Section 01

Introduction: Hearth—A Declarative Large Model Inference Service Framework on Kubernetes

This article introduces the open-source Hearth project, discussing how to implement declarative, auto-scaling-to-zero large language model (LLM) inference services on Kubernetes. It addresses resource cost and operational challenges in LLM inference, while analyzing the technical evolution trends of cloud-native AI infrastructure. Key highlights include declarative configuration to simplify operations, Scale-to-Zero to optimize costs, and vendor-neutral design to avoid lock-in.

2

Section 02

Infrastructure Challenges of LLM Inference

With the widespread application of large language models, inference services face highly fluctuating request loads, strict latency requirements, and high GPU resource costs. Traditional persistent services waste resources during low traffic periods, and manual scaling struggles to handle peak loads. Although Kubernetes is a cloud-native foundation, the characteristics of LLM inference—such as long model loading times, large memory usage, and stateful requests—make its general solutions difficult to apply directly, requiring specialized optimization tools.

3

Section 03

Core Concepts of Hearth: Declarative and Scale-to-Zero

Declarative Configuration: Through Kubernetes Custom Resource Definitions (CRDs), users only need to describe the target state of the model service (e.g., model source, resource requirements). Hearth handles underlying deployment, scaling, and other logic to simplify operations.

Scale-to-Zero: Scale down to zero to release GPU resources when there are no requests, and trigger rapid scaling upon new requests. While cold start introduces latency, cost savings are significant in non-real-time scenarios (asynchronous batch processing, development and testing).

4

Section 04

Architecture Design and Technology Selection

Adopting the Kubernetes Operator pattern:

  • CRD and API Design: The api/v1alpha1 directory defines custom resources, supporting configurations such as model source, inference engine (vLLM/TensorRT-LLM, etc.), resource requirements, and scaling policies.
  • Controller Implementation: The internal directory monitors resource changes and reconciles actual and desired states, including configuration parsing, K8s resource creation, and scaling rule configuration.
  • Helm Chart Deployment: charts/hearth provides a Helm Chart to simplify installation, including RBAC permissions, Webhook configurations, etc.
5

Section 05

Vendor-Neutral Design Philosophy

Emphasizing vendor-neutrality to avoid lock-in:

  • Model Format Neutrality: Supports Hugging Face Transformers, GGUF, ONNX, etc.
  • Inference Engine Neutrality: Can switch between vLLM, TensorRT-LLM, TGI, etc.
  • Infrastructure Neutrality: Based on standard K8s APIs, can run on public clouds, private clouds, or edge environments.
6

Section 06

Technical Challenges of Scale-to-Zero

Implementing Scale-to-Zero requires solving:

  • Cold Start Latency: Mitigated via model caching, layered loading, preloading daemons, and request queuing/batch processing.
  • Request Routing: Using proxies like Knative Serving to receive requests and trigger scaling.
  • State Management: Designing state persistence strategies to ensure recovery of context such as conversation history and KV cache after scaling.
7

Section 07

Applicable Scenarios and Limitations

Applicable Scenarios: Development and testing environments (reducing resource costs), low-frequency batch processing tasks (task-triggered scaling), multi-tenant services (on-demand resource allocation).

Limitations: High-concurrency, low-latency production services still require persistent instances. Hearth supports multiple deployment modes for users to choose from.

8

Section 08

Open-Source Significance and Community Value

Value of Hearth's open-source:

  1. Provides production-grade reference implementations, offering a starting point and benchmark for teams to evaluate technical solutions.
  2. The open-source model aggregates community best practices to form comprehensive solutions.
  3. Represents the cloud-native AI direction, treating AI workloads as first-class citizens to enhance automation, observability, and portability.