Reading

InSeeDent: An AI Root Cause Analysis Platform Based on Multi-Agent Workflow

InSeeDent is an AI-driven intelligent fault platform for DevOps and SRE teams. It enables automated root cause analysis (RCA) and real-time diagnosis of production faults through multi-agent workflows, RAG technology, and multi-source telemetry data fusion.

AIOpsDevOpsSRE根因分析多智能体LangGraphRAG故障诊断可观测性微服务

Published 2026-05-16 16:15Recent activity 2026-05-16 16:18Estimated read 8 min

InSeeDent: An AI Root Cause Analysis Platform Based on Multi-Agent Workflow

Section 01

InSeeDent: Guide to the AI Root Cause Analysis Platform

InSeeDent is an open-source AI-driven intelligent fault platform for DevOps and SRE teams. Its core functions include automated root cause analysis (RCA) and real-time diagnosis of production faults through multi-agent workflows, RAG technology, and multi-source telemetry data fusion. Its design philosophy emphasizes "offline-first", supports flexible LLM integration strategies (mock, local Ollama, OpenAI API), and provides a dark-themed operation and maintenance dashboard as well as interactive chat investigation features, aiming to solve the pain points of fault troubleshooting in production environments.

Section 02

Pain Points and Background of Fault Diagnosis in Production Environments

In modern microservice architectures, fault troubleshooting requires manual correlation of multiple data sources such as logs, metrics, and trace data, taking hours or even days, which seriously affects MTTR (Mean Time to Repair). Traditional monitoring tools can only issue alerts but lack intelligent analysis capabilities; existing AIOps solutions are expensive, complex to deploy, and strongly dependent on cloud APIs. The industry urgently needs a lightweight, offline-capable intelligent root cause analysis solution.

Section 03

InSeeDent Project Overview and System Architecture

Project Overview: Designed for DevOps/SRE teams, InSeeDent collects signals from observability tools (logs, metrics, trace data, etc.), runs LangGraph multi-agent RCA workflows, displays analysis results on an operation and maintenance dashboard, and supports chat-based investigation. It prioritizes offline use by default, using a rule-based simulation engine to generate root cause results, making it suitable for demonstration and offline scenarios.

System Architecture:

Frontend Layer: React + Tailwind + TypeScript, including modules such as fault dashboard, telemetry ingestion, timeline, investigation chat, and service dependency graph.
Backend Layer: Java17 + Spring Boot3.2, providing RESTful APIs, supporting JWT authentication, JPA persistence, Flyway migrations, and compatible with H2 and PostgreSQL.
AI Service Layer: Python3.11 + FastAPI + LangGraph, including agents for log analysis, metrics analysis, trace analysis, deployment correlation, correlation analysis, and summarization.

Section 04

Detailed Explanation of Multi-Agent Workflow and RAG Features

Multi-Agent Workflow: When RCA is triggered, the process is executed as follows: log_agent → metrics_agent → trace_agent → deployment_agent → correlation_agent → summarizer_agent. Responsibilities of each agent: Log agent extracts anomalies; Metrics agent analyzes time-series data; Trace agent locates call chain faults; Deployment agent correlates faults with changes; Correlation agent constructs propagation paths; Summarizer agent generates RCA reports with confidence levels and repair suggestions.

RAG Features: Integrates Retrieval-Augmented Generation (RAG) technology, which vectorizes and stores historical fault cases and operation and maintenance manuals. When a new fault occurs, it automatically matches similar cases, making it suitable for scenarios such as periodic faults, new member training, and organizational knowledge base accumulation.

Section 05

Flexible LLM Integration Strategies and Deployment Methods

LLM Integration Modes:

Mode	Description	Network Dependency
mock (default)	Deterministic multi-agent logic + heuristic rules	None
ollama	Local offline LLM (e.g., llama3.2, mistral)	None after model download
openai	OpenAI API (e.g., gpt-4o-mini)	Required

Deployment Methods:

Local Development: Start the backend (mvn spring-boot:run), AI service (run.sh), and frontend (npm run dev), then access http://localhost:5173. The default account is admin/admin123.
Docker Compose: Copy .env.example to .env, then execute docker compose up --build to start the full-stack service (including PostgreSQL and optional Kafka).

Section 06

Practical Value and Future Outlook

Practical Value: Reduces MTTR (from hours to minutes), lowers the manual analysis burden on engineers, accumulates fault handling knowledge, supports offline scenarios, and has an extensible architecture (integrates with existing toolchains).

Summary and Outlook: InSeeDent represents a practical direction in the AIOps field, combining LLM reasoning with engineering practices, making it suitable for small and medium-sized teams and data-sensitive enterprises. In the future, it will add more enterprise-level features such as integration with additional data sources, predictive alerts, and deep LLM integration, which is worth DevOps teams' attention and trial.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15