name: mape-k-troubleshoot description: > Diagnoses and fixes issues in the MAPE-K self-healing loop of x0tta6bl4. Use when user says "self-healing broken", "MAPE-K not working", "node not recovering", "healing loop stuck", "MTTR too high", "auto-recovery failed", or "diagnose self-healing". metadata: author: x0tta6bl4 version: 1.0.0 category: operations tags: [mape-k, self-healing, troubleshooting, autonomic]
MAPE-K Self-Healing Troubleshooting
Overview
x0tta6bl4 uses the MAPE-K (Monitor-Analyze-Plan-Execute over Knowledge) autonomic computing loop for self-healing. Target MTTR is under 3 minutes.
Key files:
src/core/mape_k_loop.py- Core MAPE-K implementationsrc/self_healing/mape_k.py- Self-healing integrationsrc/self_healing/mape_k_integrated.py- Full integrated loopsrc/core/health.py- Health check providers
Instructions
Step 1: Identify the Stuck Phase
The MAPE-K loop has 4 phases. Determine which phase is failing:
Monitor phase issues:
- Symptoms: No metrics being collected, stale data
- Check:
src/monitoring/metrics.py,src/monitoring/prometheus_client.py - Verify Prometheus scraping is active on port 9090
Analyze phase issues:
- Symptoms: Anomalies not detected, false positives
- Check:
src/ml/graphsage_anomaly_detector.py - Verify anomaly threshold (default 0.6, adjustable)
- Check if model is in observe-only mode:
src/ml/graphsage_observe_mode.py
Plan phase issues:
- Symptoms: Correct detection but no recovery plan generated
- Check: Planning logic in
src/self_healing/mape_k_integrated.py - Verify action policies are not too restrictive
Execute phase issues:
- Symptoms: Plan generated but not executed
- Check: Circuit breaker state (may be open after too many failures)
- Check: SPIFFE identity valid (execution requires authenticated context)
Step 2: Check Health Endpoints
# Overall health
curl -s http://localhost:8080/health
# Detailed status with MAPE-K state
curl -s http://localhost:8080/api/v1/mesh/status
# Prometheus metrics for MAPE-K
curl -s http://localhost:9090/api/v1/query?query=mape_k_cycle_duration_seconds
# Programmatic health check (no server required)
from src.core.health import get_health_with_dependencies
import json
print(json.dumps(get_health_with_dependencies(), indent=2))
Step 3: Review Logs
# Docker
docker-compose logs --tail=100 app | grep -i "mape\|heal\|anomal"
# Kubernetes
kubectl logs -n x0tta6bl4 deployment/proxy-api --tail=100 | grep -i "mape\|heal"
# Local
grep -i "mape\|heal\|anomal" /var/log/x0tta6bl4/app.log
Look for these patterns:
MAPE-K cycle completed- Loop is runningAnomaly detected- Analysis phase workingRecovery plan generated- Planning phase workingExecuting recovery action- Execute phase workingCircuit breaker OPEN- Executions halted (too many failures)
Step 4: Iterative Fix
Based on the stuck phase, apply fixes:
Fix Monitor Phase
- Verify metrics collection:
from src.monitoring.metrics import MetricsRegistry # MetricsRegistry exposes individual class-level Prometheus objects. # Check key MAPE-K counters directly: print(MetricsRegistry.mapek_cycles_total) # Counter print(MetricsRegistry.mapek_anomalies_detected) # Counter print(MetricsRegistry.mapek_cycle_duration) # Histogram print(MetricsRegistry.self_healing_mttr_seconds)# Histogram # Full list: inspect.getmembers(MetricsRegistry, lambda v: not callable(v)) - Check Prometheus target is up:
http://localhost:9090/targets - Verify health providers are registered in
src/core/health.py
Fix Analyze Phase
- Check GraphSAGE model state:
from src.ml.graphsage_anomaly_detector import GraphSAGEAnomalyDetector detector = GraphSAGEAnomalyDetector() # If model is None, torch not available - falls back to rule-based print(f"Model: {detector.model}, Threshold: {detector.anomaly_threshold}") - Adjust threshold if too high (missing anomalies) or too low (false positives)
- Check if observe mode is stuck: disable with
detector.observe_mode = False
Fix Plan Phase
- Verify recovery strategies are registered
- Check if action quotas are exhausted
- Verify SPIFFE identity for cross-node recovery
Fix Execute Phase
- Reset circuit breaker if stuck open:
from src.self_healing.recovery_actions import RecoveryActionExecutor executor = RecoveryActionExecutor() # Check circuit breaker state print(executor.get_circuit_breaker_status()) # {'enabled': True, 'state': 'open'/'closed'/'half_open', 'failures': N, ...} # Circuit breaker auto-resets after timeout; force reset by re-instantiating - Inspect recovery action history:
history = executor.get_action_history(limit=20) # List[RecoveryResult] for r in history: print(f"{r.action_type.value}: success={r.success}, duration={r.duration_seconds:.2f}s") print(executor.get_success_rate()) # Overall 0.0–1.0 - Check SPIFFE certificate expiry
- Verify target node is reachable
Step 5: Validate Fix
After applying fix, verify the loop completes:
# Watch for a full MAPE-K cycle
curl -s http://localhost:8080/api/v1/mesh/status | python3 -c "
import sys, json
data = json.load(sys.stdin)
print(f'MAPE-K status: {data.get(\"mape_k_status\", \"unknown\")}')
print(f'Last cycle: {data.get(\"last_mape_k_cycle\", \"never\")}')
"
Re-run if the issue persists. Check each phase sequentially until the full loop completes within the 3-minute MTTR target.
Common Issues
Circuit breaker stuck open
Cause: Too many consecutive recovery failures Solution: Fix the underlying failure first, then wait for circuit breaker half-open timeout or restart the service
GraphSAGE model not loading
Cause: torch-geometric not installed or GPU not available
Solution: Falls back to rule-based detection automatically. Install
torch-geometric for ML-based detection: pip install torch-geometric
MAPE-K cycle too slow (MTTR > 3 min)
Cause: Analysis phase taking too long, or network latency Solution: Reduce anomaly detection batch size, increase monitoring frequency