Zing Forum

Reading

Barrel Inference: Innovative Practice of Natively Integrating LLM Inference into the Erlang/OTP Ecosystem

Barrel Inference is an open-source project that natively integrates large language model (LLM) inference capabilities into the Erlang/OTP ecosystem. By calling llama.cpp via dirty NIFs, it implements supervised model processes, token-precise hierarchical KV caching, and an HTTP service daemon compatible with OpenAI, Anthropic, and Ollama.

ErlangOTPLLM推理llama.cppBEAMNIFOpenAI API开源
Published 2026-05-26 07:42Recent activity 2026-05-26 07:49Estimated read 6 min
Barrel Inference: Innovative Practice of Natively Integrating LLM Inference into the Erlang/OTP Ecosystem
1

Section 01

Introduction / Main Post: Barrel Inference: Innovative Practice of Natively Integrating LLM Inference into the Erlang/OTP Ecosystem

Barrel Inference is an open-source project that natively integrates large language model (LLM) inference capabilities into the Erlang/OTP ecosystem. By calling llama.cpp via dirty NIFs, it implements supervised model processes, token-precise hierarchical KV caching, and an HTTP service daemon compatible with OpenAI, Anthropic, and Ollama.

3

Section 03

Background: Why Do We Need a Native OTP Inference Runtime?

In current LLM deployment practices, most inference services are built on the Python ecosystem, such as vLLM and TGI. While these solutions are powerful, they mean that developers building distributed systems with Erlang/OTP need to maintain an additional Python sidecar service, increasing system complexity and operational costs.

Barrel Inference was created to address this pain point. Its core idea is to treat LLM inference capabilities as first-class citizens of OTP (Open Telecom Platform) rather than external dependencies. This means inference services can be supervised, managed, and scheduled like other Erlang processes, fully leveraging OTP's powerful fault tolerance and distributed features.


4

Section 04

Project Architecture: Three-Tier Modular Design

Barrel Inference uses a rebar3 umbrella project structure, dividing its functionality into three independent application modules, each of which can be released as a separate Hex package:

5

Section 05

1. barrel_inference — Core Runtime

This is the core of the entire project, responsible for actual model inference. It calls the underlying llama.cpp library via dirty NIFs (Non-Integrable Functions) and implements the following key features:

  • Supervised model processes: Each loaded model runs in an independent supervised process, following OTP's supervision tree design principles
  • Token-precise hierarchical KV caching: Implements fine-grained KV cache management, supports hierarchical storage strategies, and optimizes memory usage efficiency
  • Automatic cleanup on connection cancellation: Automatically cancels related inference tasks and cleans up resources when the client disconnects
6

Section 06

2. barrel_inference_server — API Daemon

This layer provides HTTP API interfaces compatible with mainstream AI services, including:

  • OpenAI-compatible API: Supports standard endpoints like /v1/chat/completions
  • Anthropic-compatible API: Supports Claude-style interface calls
  • Ollama-compatible API: Supports Ollama's local model management protocol

In addition, it provides features such as a model registry, model-specific request queuing, keep-alive connections, and metrics monitoring.

7

Section 07

3. barrel_inference_cli — Command-Line Tool

To facilitate operation, maintenance, and development, the project provides a complete CLI tool barrel-inference that supports the following commands:

  • serve: Starts the API service daemon
  • pull <model>: Pulls model files from remote sources
  • run <model> "<prompt>": Executes a one-time inference task
  • ps: Lists all currently loaded models
  • rm: Removes loaded models

8

Section 08

Supervision and Fault Tolerance

Thanks to OTP's supervision tree mechanism, Barrel Inference can automatically handle model process crashes. When a model process exits abnormally, the supervisor restarts it according to a preset strategy, ensuring continuous service availability. This design is particularly important in production environments, as it can significantly reduce the need for manual intervention.