# Hearth: A Declarative Large Model Inference Service Framework on Kubernetes

> Introducing the open-source Hearth project, discussing how to implement declarative, auto-scaling-to-zero large language model (LLM) inference services on Kubernetes, and the technical evolution trends of cloud-native AI infrastructure.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-08T11:45:56.000Z
- 最近活动: 2026-06-08T11:58:16.699Z
- 热度: 159.8
- 关键词: Kubernetes, 大语言模型, 推理服务, Scale-to-Zero, 云原生, LLM, 自动扩缩容, Operator
- 页面链接: https://www.zingnex.cn/en/forum/thread/hearth-kubernetes
- Canonical: https://www.zingnex.cn/forum/thread/hearth-kubernetes
- Markdown 来源: floors_fallback

---

## Introduction: Hearth—A Declarative Large Model Inference Service Framework on Kubernetes

This article introduces the open-source Hearth project, discussing how to implement declarative, auto-scaling-to-zero large language model (LLM) inference services on Kubernetes. It addresses resource cost and operational challenges in LLM inference, while analyzing the technical evolution trends of cloud-native AI infrastructure. Key highlights include declarative configuration to simplify operations, Scale-to-Zero to optimize costs, and vendor-neutral design to avoid lock-in.

## Infrastructure Challenges of LLM Inference

With the widespread application of large language models, inference services face highly fluctuating request loads, strict latency requirements, and high GPU resource costs. Traditional persistent services waste resources during low traffic periods, and manual scaling struggles to handle peak loads. Although Kubernetes is a cloud-native foundation, the characteristics of LLM inference—such as long model loading times, large memory usage, and stateful requests—make its general solutions difficult to apply directly, requiring specialized optimization tools.

## Core Concepts of Hearth: Declarative and Scale-to-Zero

**Declarative Configuration**: Through Kubernetes Custom Resource Definitions (CRDs), users only need to describe the target state of the model service (e.g., model source, resource requirements). Hearth handles underlying deployment, scaling, and other logic to simplify operations.

**Scale-to-Zero**: Scale down to zero to release GPU resources when there are no requests, and trigger rapid scaling upon new requests. While cold start introduces latency, cost savings are significant in non-real-time scenarios (asynchronous batch processing, development and testing).

## Architecture Design and Technology Selection

Adopting the Kubernetes Operator pattern:
- **CRD and API Design**: The `api/v1alpha1` directory defines custom resources, supporting configurations such as model source, inference engine (vLLM/TensorRT-LLM, etc.), resource requirements, and scaling policies.
- **Controller Implementation**: The `internal` directory monitors resource changes and reconciles actual and desired states, including configuration parsing, K8s resource creation, and scaling rule configuration.
- **Helm Chart Deployment**: `charts/hearth` provides a Helm Chart to simplify installation, including RBAC permissions, Webhook configurations, etc.

## Vendor-Neutral Design Philosophy

Emphasizing vendor-neutrality to avoid lock-in:
- **Model Format Neutrality**: Supports Hugging Face Transformers, GGUF, ONNX, etc.
- **Inference Engine Neutrality**: Can switch between vLLM, TensorRT-LLM, TGI, etc.
- **Infrastructure Neutrality**: Based on standard K8s APIs, can run on public clouds, private clouds, or edge environments.

## Technical Challenges of Scale-to-Zero

Implementing Scale-to-Zero requires solving:
- **Cold Start Latency**: Mitigated via model caching, layered loading, preloading daemons, and request queuing/batch processing.
- **Request Routing**: Using proxies like Knative Serving to receive requests and trigger scaling.
- **State Management**: Designing state persistence strategies to ensure recovery of context such as conversation history and KV cache after scaling.

## Applicable Scenarios and Limitations

**Applicable Scenarios**: Development and testing environments (reducing resource costs), low-frequency batch processing tasks (task-triggered scaling), multi-tenant services (on-demand resource allocation).

**Limitations**: High-concurrency, low-latency production services still require persistent instances. Hearth supports multiple deployment modes for users to choose from.

## Open-Source Significance and Community Value

Value of Hearth's open-source:
1. Provides production-grade reference implementations, offering a starting point and benchmark for teams to evaluate technical solutions.
2. The open-source model aggregates community best practices to form comprehensive solutions.
3. Represents the cloud-native AI direction, treating AI workloads as first-class citizens to enhance automation, observability, and portability.