Zing Forum

Reading

TingIS: Enterprise-Grade Real-Time Risk Event Discovery System, Using Large Models to Extract Key Signals from Massive Noise

The Alibaba Cloud team open-sourced the TingIS system. By combining a multi-stage event linking engine with large language models, it extracts actionable risk events from over 2000 user feedback entries per minute, achieving a 95% high-priority event discovery rate and a P90 latency of 3.5 minutes.

智能运维AIOps大语言模型事件发现实时系统噪音过滤云原生故障检测
Published 2026-04-24 01:40Recent activity 2026-04-24 13:19Estimated read 4 min
TingIS: Enterprise-Grade Real-Time Risk Event Discovery System, Using Large Models to Extract Key Signals from Massive Noise
1

Section 01

Introduction: TingIS—Enterprise-Grade Real-Time Risk Event Discovery System

The Alibaba Cloud team open-sourced the TingIS system. By combining a multi-stage event linking engine with large language models, it extracts actionable risk events from over 2000 user feedback entries per minute, achieving a 95% high-priority event discovery rate and a P90 latency of 3.5 minutes, helping enterprises solve operation and maintenance challenges in the cloud-native era.

2

Section 02

Background: Cloud-Native O&M Dilemmas and the Value of User Feedback

In the cloud-native era, system complexity grows exponentially, and traditional monitoring systems have blind spots. User feedback contains semantic information that system monitoring cannot capture, but converting it into risk signals faces challenges such as high noise ratio, complex semantics, high real-time requirements, and difficulty in event aggregation.

3

Section 03

TingIS System Architecture: Three-Core Design Layers and Key Mechanisms

  1. Multi-stage event linking engine: Efficient index recall of candidates → LLM intelligent association judgment → Incremental event maintenance; 2. Cascaded business routing mechanism: Coarse-grained classification → Fine-grained attribution → Dynamic load balancing; 3. Multi-dimensional noise reduction pipeline: Domain knowledge filtering → Statistical pattern recognition → Behavioral feature filtering → LLM semantic verification.
4

Section 04

Production Environment Performance: Data Validation of System Efficacy

Peak processing of over 2000 entries per minute, 300,000 entries per day on average; P90 latency of 3.5 minutes; 95% high-priority event discovery rate; Comparative tests show that routing accuracy, clustering quality, and signal-to-noise ratio are all better than baseline methods.

5

Section 05

Technical Highlights and Industry Insights

Technical highlights: Deep integration of engineering and algorithms, pragmatic application of LLM (using LLM in key links, traditional methods in others), interpretability and controllability; Industry insights: User feedback is an important data dimension for O&M, value of deep LLM application in vertical scenarios, layered architecture balancing real-time performance and quality.

6

Section 06

Limitations and Future Optimization Directions

Current limitations: Long cold start cycle, insufficient multi-language support, limited root cause localization, lack of predictive capabilities; Future directions: Shorten cold start, adapt to multi-languages, integrate root cause analysis, implement predictive alerts.