name: bian-que-agentic-operations description: "Agentic framework for online system operations with Flexible Skill Arrangement. Use when designing LLM-based agents for system O&M, IT operations, DevOps automation, alert management, root cause analysis, or self-evolving agent systems. Activation triggers: system operations, O&M automation, agentic operations, skill arrangement, alert root cause analysis, self-evolving agents, release interception, proactive inspection, KuaiShou operations."
Bian Que: Agentic Framework for Online System Operations
An agentic framework with three contributions for operating large-scale online systems: unified operational paradigm, Flexible Skill Arrangement, and unified self-evolving mechanism. Deployed at KuaiShou with 75% alert reduction, 80% RCA accuracy, 50%+ MTTR reduction.
Metadata
- Source: arXiv:2604.26805
- Authors: Bochao Liu, Zhipeng Qian, Yang Zhao, et al.
- Published: 2026-04-29
- Code: Available at GitHub
Core Methodology
Key Innovation 1: Unified Operational Paradigm
Abstract day-to-day O&M into three canonical patterns:
- Release Interception - Monitor deployments, detect anomalies, gate releases before they reach production
- Proactive Inspection - Systematically check system health, predict issues before they manifest
- Alert Root Cause Analysis - Diagnose alert triggers, trace to root cause, recommend remediation
Why this works: Instead of treating each O&M task as unique, this taxonomy covers 90%+ of operational events with reusable patterns.
Key Innovation 2: Flexible Skill Arrangement
Each Skill specifies:
- Which data to retrieve: metrics, logs, change events, traces
- Which knowledge to apply: handbook rules, practitioner experience, historical patterns
- Context binding: skills are scoped to specific business-module contexts
Skill lifecycle:
- Skills can be auto-generated by LLMs from operational requirements
- Skills can be updated via natural-language instructions from on-call engineers
- Skills compose - multiple skills can be arranged for complex operational scenarios
Critical insight: The deployment bottleneck for LLM-based O&M agents is not reasoning capability but orchestration - selecting relevant data/knowledge for each event. Feeding all signals causes dilution and hallucination; manual curation is intractable.
Key Innovation 3: Unified Self-Evolving Mechanism
One correction signal drives two parallel pathways:
- Case-Memory-to-Knowledge Distillation: Historical incident cases are distilled into structured operational knowledge
- Targeted Skill Refinement: Specific skills are refined based on correction signals
Architecture:
Operational Event → Skill Selection → Data/Knowledge Retrieval → LLM Reasoning → Action
↓
Correction Signal
↓
┌───────────────────────────────┴───────────────────────────────┐
↓ ↓
Case-Memory-to-Knowledge Targeted Skill Refinement
Distillation Pipeline (update skill definition)
Implementation Guide
Prerequisites
- LLM with tool-use capabilities
- Access to system metrics, logs, change events
- Knowledge base of operational procedures
Step-by-Step
Define Operational Taxonomy
- Catalog all O&M events into the three canonical patterns
- For each pattern, identify common sub-scenarios
Design Skills
- For each sub-scenario, create a skill specifying:
- Required data sources (metrics, logs, traces, change events)
- Required knowledge sources (runbooks, historical cases, expert rules)
- Output format (diagnosis, remediation plan, escalation decision)
- For each sub-scenario, create a skill specifying:
Implement Skill Registry
- Skills stored as structured objects (JSON/YAML)
- Skills indexed by business-module context and event type
- Support for skill versioning and A/B testing
Build Self-Evolution Loop
- Capture correction signals (human feedback, automated validation)
- Store incident cases with resolution outcomes
- Periodically distill cases into knowledge updates
- Refine skills based on performance metrics
Pitfalls
- Signal dilution: Feeding all operational data to the LLM causes hallucination. Skills must precisely scope data retrieval.
- Cold start: Initial skill generation may require human-in-the-loop. Start with high-value scenarios first.
- Knowledge staleness: Skills must be regularly validated against current system state.
- Over-generalization: Skills scoped too broadly lose precision; too narrow become unmaintainable.
Applications
- IT operations automation (AIOps)
- Cloud infrastructure management
- E-commerce platform operations
- Microservice incident management
- DevOps workflow automation
Related Skills
- agentic-behavioral-modeling
- agentic-fast-slow-planning
- logact-agentic-reliability
- emergent-systems-design