Zing Forum

Reading

Token-Aware-Balancer: An Intelligent LLM Load Balancer Based on Token Counting

This article introduces an innovative open-source project called Token-Aware-Balancer, developed in Go. It is an L7 reverse proxy that routes requests based on token count rather than connection count, optimized specifically for large language model (LLM) inference services. It can reduce P99 latency by 12% in high-concurrency scenarios.

大语言模型负载均衡反向代理Go语言Token计数推理优化高并发延迟优化LLM部署开源工具
Published 2026-04-06 20:43Recent activity 2026-04-06 20:57Estimated read 8 min
Token-Aware-Balancer: An Intelligent LLM Load Balancer Based on Token Counting
1

Section 01

Introduction: Token-Aware-Balancer—An Intelligent LLM Load Balancer Based on Token Counting

This article introduces the open-source project Token-Aware-Balancer, an L7 reverse proxy developed in Go and optimized for LLM inference services. Its core innovation lies in using token count (instead of connection count/request count) as the basis for load balancing, which can more accurately reflect the actual load of backend servers and reduce P99 latency by 12% in high-concurrency scenarios. The project addresses the adaptation issue of traditional load balancers to heterogeneous LLM requests, providing an intelligent solution for efficient deployment of LLM inference services.

2

Section 02

Project Background: Limitations of Traditional Load Balancers in LLM Inference Scenarios

With the widespread application of LLMs, efficiently deploying and scaling inference services has become a key challenge. Traditional load balancing strategies (connection count/request count/round-robin) have obvious shortcomings: the number of tokens in different LLM requests varies greatly (from a few to thousands), and coarse-grained strategies cannot accurately assess the load, leading to some servers being overloaded while others are idle, which affects service quality. Token-Aware-Balancer is designed to address this issue, with its core being routing decisions based on "in-flight token count".

3

Section 03

Core Methods: Token-Aware Load Balancing Strategy and Technical Architecture

Design Philosophy

  • Defects of Traditional Strategies: Connection count/request count ignore token differences; round-robin does not consider actual load.
  • Advantages of Token Awareness: Tokens are the basic unit of LLM computation; the number of in-flight tokens better reflects server busyness and supports predictive routing.

Technical Implementation

  • L7 Reverse Proxy: Parse HTTP request content and extract LLM-related information.
  • Token Counting Mechanism: Parse request text → tokenize and calculate → estimate output tokens → update in-flight count → decrement after request completion.
  • Intelligent Routing Algorithm: Least tokens first, estimated completion time sorting, weighted distribution, dynamic threshold adjustment.
  • Health Check: Active detection + passive monitoring + graceful failover + automatic recovery.
4

Section 04

Performance Evidence: 12% Reduction in P99 Latency Under High Concurrency and Resource Utilization Optimization

  • P99 Latency Improvement: In stress tests, compared to traditional connection count strategies, P99 latency is reduced by 12%, improving user experience (especially for interactive applications), increasing throughput, and reducing long-tail latency.
  • Resource Utilization Optimization: Avoid server overload/idle, balance GPU utilization, and reduce queuing latency.
5

Section 05

Applicable Scenarios: Multi-Tenant, Mixed Load, and Other LLM Deployment Scenarios

Token-Aware-Balancer is particularly suitable for:

  1. Multi-Tenant Services: Balance heterogeneous requests from different tenants to ensure service quality.
  2. Mixed Load Environments: Handle various request types such as short Q&A and long document summaries.
  3. Heterogeneous Hardware Clusters: Dynamically distribute load based on server capabilities.
  4. High-Concurrency Inference Services: Improve latency distribution and provide stable services.
6

Section 06

Deployment and Usage: An Easy-to-Integrate Go Service

  • Basic Configuration: Specify backend servers, routing strategies, token counting parameters, health check thresholds, etc., via configuration files/command lines.
  • Integration Methods: Frontend proxy, Kubernetes integration (Service/Ingress), service mesh (Istio/Linkerd), cloud-native deployment (Docker/K8s).
  • Monitoring: Provide metrics such as in-flight token count, latency statistics, error rate, etc., and support Prometheus/Grafana visualization.
7

Section 07

Limitations and Future: Current Restrictions and Development Plans

Current Limitations

  • Tokenizer Dependency: Must use the same tokenizer as the backend LLM; model differences affect accuracy.
  • Estimation Uncertainty: There are errors in output token count estimation.
  • Single Point of Failure: Centralized proxy requires high-availability deployment.

Future Directions

  • Multi-Model Support: Expand to more LLM architectures and tokenizers.
  • Adaptive Estimation: Use machine learning to improve output token estimation accuracy.
  • Distributed Architecture: Eliminate single-point bottlenecks.
  • Deep Integration with Inference Engines: Obtain more accurate internal states.
  • Cost-Aware Routing: Optimize costs by combining cloud billing.
8

Section 08

Conclusion: Innovative Value of LLM Inference Infrastructure

Token-Aware-Balancer is an important innovation in LLM inference service infrastructure. It achieves precise load balancing through token counting, and the 12% reduction in P99 latency significantly improves user experience. This project provides an intelligent solution for efficient LLM deployment, has reference value for teams building/optimizing LLM inference services, and also promotes the evolution of LLM service architectures.