Section 01
Introduction: STAR Framework—Enabling Self-Correction in AI for Microservice Fault Diagnosis
Against the backdrop of complex microservice architectures, traditional manual root cause analysis (RCA) is time-consuming and labor-intensive, while LLM-driven intelligent diagnostic agents often fail due to single-point errors in their reasoning chains. The STAR framework significantly improves the reliability and debuggability of agents through mechanisms such as four-stage workflow decomposition (evidence package, hypothesis set, analysis structure, decision report), fast/slow routing resource allocation, counterfactual evaluation to locate faulty stages, and stage-specific repair (patching and replaying). Experiments verify that it outperforms baselines in root cause localization and fault classification, and most errors can be corrected with 1-2 rounds of repair.