Zing Forum

Reading

Glassbox: A Learning Journey to Build a Local LLM Inference Engine from Scratch

An open-source project for learning ML infrastructure, which builds a local large language model (LLM) inference engine with OpenAI-compatible APIs by gradually replacing black-box abstractions.

LLMInference EngineTinyLlamaFastAPIOpenAI APIGPU InferenceTransformerPyTorchMachine LearningEducation
Published 2026-06-07 04:14Recent activity 2026-06-07 04:22Estimated read 6 min
Glassbox: A Learning Journey to Build a Local LLM Inference Engine from Scratch
1

Section 01

Glassbox: An Educational Open-Source Project for Local LLM Inference

Project Overview Glassbox is an educational open-source project by Baighasan (hosted on GitHub: glassbox-inference, released on 2026-06-06). Its core goal is to build a local LLM inference engine that runs TinyLlama on GPU, provides OpenAI-compatible API, implements custom greedy decoding, and reports key metrics (latency, tokens per second, memory usage). The project focuses on learning value by progressively replacing black-box abstractions with explicit implementations to demystify ML inference infrastructure.

2

Section 02

Vision & Core Philosophy

Vision & Philosophy Glassbox's name reflects its core philosophy: turning ML inference from a "black box" into a "glass box" for learning. Unlike performance-optimized engines, it prioritizes understanding by starting with high-level abstractions (e.g., Hugging Face's model.generate()) and gradually replacing them with low-level code (e.g., explicit model.forward() calls). The ultimate aim is to let users grasp every layer of the inference stack.

3

Section 03

Architecture & API Design

Architecture & API Design The project uses a layered architecture:

  1. Inference Server: FastAPI-based entry point with OpenAI-compatible endpoints (health check, models list, completions, chat completions).
  2. Core Components: OpenAI request validation, prompt formatter (adapts to model templates), Glassbox inference engine (coordinates tokenization, model execution, decoding), tokenizer wrapper, model runner (loads and runs models).
  3. API Constraints: Rejects streaming, non-zero temperature, multiple candidates (n>1), and tool calls to keep the MVP focused.
4

Section 04

Milestones & Metrics Collection

Milestones & Metrics The project has 8 clear milestones:

  1. Project skeleton (structure, config, tests).
  2. OpenAI API shell (mock responses).
  3. GPT-2 running on CPU (using model.generate()).
  4. Benchmark script (measures latency, tokens per second).
  5. Replace model.generate() with custom greedy decoding (explicit forward() calls).
  6. GPU support (CUDA, memory metrics).
  7. TinyLlama MVP (GPU run, chat template, full metrics).
  8. Final docs & summary.

Key metrics collected: model load time, prompt/completion token counts, total latency, tokens per second, device type (CPU/GPU), data type, GPU memory usage.

5

Section 05

Target Hardware & Scope Control

Target Hardware & Scope Control Target hardware: Ubuntu server, Intel Core i9-9880H, 32GB RAM, NVIDIA Quadro T2000 (4GB GPU memory). TinyLlama (1.1B params) is chosen due to the 4GB memory constraint.

Non-goals for MVP: streaming responses, request batching/queuing, KV cache, model quantization, Docker containerization, distributed inference, C++/CUDA runtime code.

6

Section 06

Project Value & Future Directions

Value & Future Directions Glassbox's value lies in its learning methodology: progressive拆解 of abstractions, measurable milestones, and integration of engineering practices (API design, metrics, testing) with ML theory.

Future plans:

  • Performance: KV cache implementation, quantization, 首token time metrics.
  • Features: Streaming responses, request queuing/batching.
  • Architecture: Separate API server from model workers, explore Go for control plane, C++/CUDA for runtime, distributed inference.