acm-cluster-remediation - SKILL.md Agent Skill

name: acm-cluster-remediation description: Remediate ACM hub cluster issues with structured approval workflow. Proposes fixes based on diagnosis findings, executes approved mutations, and verifies results. Use when asked to fix, remediate, repair, or resolve ACM cluster issues. compatibility: "Requires oc CLI logged into an ACM hub with permissions to patch, scale, and restart resources. Uses acm-hub-health-check skill for verification. Uses acm-cluster-health skill for post-fix validation." metadata: author: acm-qe version: "1.0.0"

ACM Cluster Remediation

Executes cluster mutations to fix diagnosed issues. Works with structured approval gates to ensure safety.

Standalone operation: If invoked directly without prior diagnosis findings in the conversation:

Inform the user: "I need diagnosis findings before proposing remediation. Ask me to 'check my hub health' first, or describe the specific issue you want to fix and I'll verify it before proposing a remediation plan."
If the user describes a specific issue: perform ONLY a targeted verification of that specific issue (not a full diagnostic), then propose a fix
Do not run a full diagnostic as a hidden prerequisite -- that is the acm-hub-health-check skill's job

When invoked after a full diagnosis (via acm-hub-health-check), it receives comprehensive findings and proposes targeted, evidence-based fixes.

Mandatory Protocol (cannot skip or reorder)

Step 1: Verify Diagnosis Exists

If diagnosis findings are available from acm-hub-health-check in the current conversation, use them. If the user described a specific issue (e.g., "search pods are crashlooping"), perform a targeted verification of that issue only:

oc get pods -n <mch-ns> | grep search
oc logs <failing-pod> -n <mch-ns> --tail=50

Do not run a broad health assessment. Verify the specific reported issue, then proceed to Step 2.

Step 2: Present Remediation Plan

Present a structured plan to the user. Use this exact format:

Remediation Plan
================

Based on [diagnosis / quick assessment], the following fixes are proposed:

Fix 1: [Title]
  Issue: [What's wrong]
  Action: [Exact command to run]
  Risk: [Low / Medium / High]
  Expected outcome: [What should happen after]

Fix 2: [Title]
  ...

Issues NOT fixable on-cluster:
  - [Issue that requires external action]

Should I proceed? (yes/no, or specify which fixes to apply)

Step 3: Get Explicit Approval

Do NOT proceed until the user explicitly approves. Accept:

"yes" -- execute all proposed fixes
"yes, fix 1 and 3 only" -- execute only specified fixes
"no" -- abort remediation entirely

Step 4: Execute Approved Fixes

For each approved fix, run the command and immediately verify:

Executing Fix 1: [Title]
  Command: oc rollout restart deploy/search-v2-operator -n <ns>
  Result: deployment.apps/search-v2-operator restarted
  Verification: oc get pods -n <ns> | grep search-v2
  Status: [OK / FAILED]

Step 5: Post-Remediation Verification

After all fixes are executed, re-run Phase 1 (Discover) and Phase 3 (Check) on affected components. Report before/after comparison.

Allowed Mutations

These commands may be used for remediation (each prompts for user permission):

oc patch -- modify resource spec/status
oc scale -- adjust replica count
oc rollout restart -- restart a deployment
oc delete pod -- restart a pod (NOT deployment, NOT CRD)
oc annotate -- add/modify annotations
oc label -- add/modify labels
oc apply -- apply a manifest

Forbidden Operations (even with user approval)

oc delete on non-pod resources (CRDs, namespaces, deployments, PVCs, StatefulSets)
oc adm drain or oc adm cordon
oc create namespace
Anything that destroys data or removes infrastructure
Any mutation during Phases 1-6 of diagnosis

Rules

NEVER execute mutations without a complete plan presented to the user
NEVER skip the approval step
NEVER execute fixes during diagnosis -- diagnosis MUST complete first
ALWAYS verify after each fix
ALWAYS run post-remediation validation
If a fix fails, report the failure and stop -- do not attempt the next fix without user acknowledgment

Gotchas

Delete pod, not deployment -- oc delete pod restarts a single pod. oc delete deployment destroys the entire workload and its replica management. Always delete the pod to trigger a restart; never delete the deployment.
Wait after rollout restart -- oc rollout restart returns immediately but pods take 30-60 seconds to cycle. Always run oc rollout status or poll pod readiness before declaring the fix successful.
ResourceQuota is an artifact, not a fix target -- If a ResourceQuota is blocking pod scheduling, it was externally applied. Deleting it may fix the symptom but violates the cluster admin's intent. Report it and let the user decide.
Scale to 0 then back is not the same as delete pod -- Scaling to 0 removes ALL replicas and may trigger dependent failures (leader election loss, webhook unavailability). Prefer oc rollout restart for deployment restarts.
Post-remediation verification must re-check dependents -- Fixing a root cause (e.g., search-postgres) does not instantly fix dependents (search-api, console). Verify the full dependency chain, not just the fixed component.