Reading

inference-research: Automated LLM Inference Engine Nightly Tracking and Benchmarking System

Inspired by Andrej Karpathy's autoresearch, it automatically crawls updates from mainstream inference engines like vLLM and SGLang every night, uses Claude Opus for intelligent filtering, and generates executable benchmark plans for DGX Spark clusters.

LLM推理vLLMSGLangTensorRT-LLM自动化研究基准测试DGX SparkClaude Opus

Published 2026-04-14 21:45Recent activity 2026-04-14 21:51Estimated read 7 min

Section 01

inference-research: Automated LLM Inference Engine Nightly Tracking and Benchmarking System Guide

inference-research is an automated tool inspired by Andrej Karpathy's autoresearch, focusing on nightly tracking and benchmarking of LLM inference engines. It addresses the challenges faced by inference system engineers in tracking technical progress, evaluating the impact of new features, and converting these into executable experimental plans. Core features include: automatically crawling updates from 5 major mainstream inference engines like vLLM and SGLang every night, using Claude Opus for intelligent filtering, and generating executable benchmark plans for DGX Spark clusters.

Section 02

Project Background and Design Philosophy

Background

Andrej Karpathy's autoresearch demonstrated methods for automated tracking of machine learning frontiers. inference-research draws on this concept but focuses on inference system optimization. Since engines like vLLM and SGLang evolve daily, manual tracking easily misses key updates, requiring an automated solution.

Design Principles

Comprehensive Coverage: Monitor 5 major mainstream inference engines
Intelligent Filtering: Claude Opus ranks updates by influence and provides explanations
Action-Oriented: Convert insights into executable benchmark plans for real hardware

Section 03

Monitored Engines and Hardware Infrastructure

Five Major Inference Engines

Project	Repository	Core Technical Focus
vLLM	vllm-project/vllm	PagedAttention, Chunked Prefilling, Speculative Decoding
SGLang	sgl-project/sglang	RadixAttention, Prefix Caching, Constrained Decoding
TensorRT-LLM	NVIDIA/TensorRT-LLM	Quantization, Dynamic Batching, Blackwell Kernels
llm-d	llm-d/llm-d	K8s Native Service, Prefill/Decode Separation
Dynamo	ai-dynamo/dynamo	KV Routing, NIXL, Separate Inference OS

Hardware Cluster

Node	IP Address	Configuration
spark-01	192.168.1.76	DGX Spark 128GB Unified Memory (NVLink-C2C)
spark-02	192.168.1.77	DGX Spark 128GB Unified Memory (NVLink-C2C)
controller	192.168.1.75	CPU-only Orchestration Node

Section 04

Automated Workflow

The system runs daily at 2 AM:

Data Collection

GitHub API: Crawl PRs and releases from 5 repositories
arXiv: Retrieve inference-related papers of the day Raw data is saved as JSON for audit support.

Intelligent Curation

Claude Opus analysis:

Influence Ranking: Grade by technical importance
Meaning Interpretation: Explain the value of changes
Impact Rating: 🔴 (High), 🟡 (Medium), 🟢 (Low)

Benchmark Plan Generation

Generate a sequence of executable bash commands for DGX Spark clusters.

Versioned Commit

All outputs (reports, data, plans, logs) are committed to Git, forming a traceable history.

Section 05

Technical Highlights and Application Scenarios

Technical Highlights

Intelligent Automation: Efficient division of labor between machine collection + AI understanding + human decision-making
Hardware-Software Integration: Deep integration with DGX clusters, converting insights into actual measurement plans
Ecosystem Panorama: Covers 5 engines with different technical routes
Scalable Architecture: Easy to add repositories, adjust strategies, or replace LLMs

Application Scenarios

Inference R&D Teams: Track competitor dynamics
AI Infra Engineers: Discover performance optimization opportunities
Technical Decision-Makers: Grasp trends to support selection
Academic Researchers: Understand industrial progress
Hardware Vendors: Optimize hardware to match software requirements

Section 06

Limitations and Improvement Directions

Limitations

Limited Data Sources: Does not cover Hugging Face, Papers with Code
Lack of Community Voices: Does not track issues and discussions
Benchmark Execution Requires Manual Effort: Not fully automated
Single Hardware Support: Only DGX Spark

Improvement Directions

Expand data sources to Hugging Face and others
Add community discussion tracking
Implement automatic benchmark execution
Support more hardware configurations

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15