Zing Forum

Reading

LLMBenchmark: A Comprehensive Evaluation Platform for Large Language Models in SMS Generation Scenarios

A modular large language model evaluation platform based on .NET 10, focusing on quality assessment of SMS generation and rewriting tasks, token estimation accuracy, latency measurement, deterministic verification, and LLM-as-a-Judge intelligent evaluation.

LLM评测大语言模型基准测试.NET短信生成Token估算模型对比LLM-as-a-Judge
Published 2026-06-16 19:46Recent activity 2026-06-16 19:49Estimated read 9 min
LLMBenchmark: A Comprehensive Evaluation Platform for Large Language Models in SMS Generation Scenarios
1

Section 01

LLMBenchmark: A Comprehensive Evaluation Platform for Large Language Models in SMS Generation Scenarios

LLMBenchmark: A Comprehensive Evaluation Platform for Large Language Models in SMS Generation Scenarios

This is a modular large language model evaluation platform based on .NET 10, focusing on quality assessment of SMS generation and rewriting tasks, token estimation accuracy, latency measurement, deterministic verification, and LLM-as-a-Judge intelligent evaluation.

Project Source

The core goal is to help developers and enterprises objectively and systematically evaluate the actual performance of different LLMs in SMS scenarios, addressing the pain point that existing general-purpose evaluation tools struggle to provide fine-grained scenario-specific comparisons.

2

Section 02

Project Background and Positioning

Project Background and Positioning

With the widespread application of LLMs across various industries, how to objectively evaluate the actual performance of different models has become a core challenge. Existing evaluation tools are often too general and struggle to provide fine-grained performance comparisons for specific business scenarios (such as SMS generation/rewriting).

LLMBenchmark was created to address this pain point. It focuses on SMS generation and rewriting scenarios, and through a structured scenario-driven framework, helps users answer key questions: Which model generates the highest quality SMS? Which has the fastest response speed? Which offers the best cost-effectiveness? Which retains placeholders most reliably?

3

Section 03

Core Architecture and Technology Stack

Core Architecture and Technology Stack

The project adopts the .NET 10 Minimal API architecture, embodying cloud-native design principles, and builds a configurable and extensible pipeline around the 'scenario-driven' concept.

Technology Stack Highlights

  • .NET 10: Leveraging high-performance features of the latest version
  • ASP.NET Core Minimal API: Lightweight, high-performance API endpoints
  • PostgreSQL: Persistent storage for evaluation results and verification data
  • Entity Framework Core: Modern data access layer
  • Docker: Containerized deployment support
  • LlmTornado: LLM interaction abstraction layer
  • SharpToken: Token counting and estimation
4

Section 04

Evaluation Pipeline and Two-Layer Verification System

Evaluation Pipeline and Two-Layer Verification System

Evaluation Pipeline

Each task is broken down into key stages:

  1. Scenario Loading: Takes JSON-format scenario files as input, where each scenario represents a specific SMS operation task (e.g., generation, rewriting).
  2. Request Construction and Token Estimation: Uses heuristic rules or the SharpToken library to estimate token consumption, providing a baseline for cost analysis.
  3. Multi-provider Execution and Latency Measurement: Supports GitHub Models and reserves extension interfaces for OpenAI, Azure OpenAI, etc., with precise end-to-end latency measurement.
  4. Result Persistence: Stores raw responses, token usage, and latency data in PostgreSQL, forming a traceable evaluation history.

Two-Layer Verification System

  • Deterministic Validator: Performs precise rule matching (e.g., placeholder retention, link format, character limits, etc.).
  • LLM-as-a-Judge Intelligent Evaluation: Assesses dimensions that cannot be quantified by hard rules, such as semantic retention, tone consistency, language quality, and instruction compliance.
5

Section 05

Supported SMS Operation Types and Token Estimation Accuracy

Supported SMS Operation Types and Token Estimation Accuracy

SMS Operation Types

The platform defines seven core operations:

Operation Type Function Description
Generate Generate a new SMS based on prompts
Rewrite Rewrite existing SMS content
Shorten Compress SMS length to meet character limits
Expand Expand SMS content to add details
Formalize Convert to formal tone
Casualize Convert to casual tone
FixGrammar Correct grammar errors

Token Estimation Accuracy

The platform compares the predicted values from the estimator with the actual token usage from providers, helping users understand the error range of different Tokenizers. This is crucial for cost budgeting and capacity planning (token usage directly affects API call costs).

6

Section 06

Practical Application Value and Future Evolution Directions

Practical Application Value and Future Evolution Directions

Practical Application Value

For developers of SMS service platforms, marketing automation systems, or customer service robots, LLMBenchmark provides a quantifiable, reproducible, and extensible model selection tool. It not only answers the qualitative question of 'which model is better' but also provides quantitative insights (e.g., Model A is 23% faster in response and 15% lower in cost than Model B, but has an 8% lower placeholder retention rate).

Future Directions

Planned to introduce:

  • Multi-provider parallel execution
  • Visual dashboard
  • Cost report generation
  • Retry strategy and fault tolerance
  • Streaming response support
  • Prompt version management
  • Scenario tag system
  • Historical trend visualization