sysdig-runtime-investigate

star 57

Use this skill when investigating a runtime threat detected by Sysdig end-to-end. Surfaces the highest-priority threat, scores vulnerability vs runtime correlations on a 1-5 confidence scale, deep-dives into network blast radius or suspicious-binary VirusTotal lookups depending on the event class, and hands the case off to Jira or PagerDuty. Triggers on: "investigate runtime threat", "what is this Falco alert", "triage this SOC alert", "analyze runtime incident". Not for vulnerability prioritization (use `sysdig-investigate`) or remediation (use `sysdig-remediate`).

sysdig By sysdig schedule Updated 6/4/2026

name: sysdig-runtime-investigate description: > Use this skill when investigating a runtime threat detected by Sysdig end-to-end. Surfaces the highest-priority threat, scores vulnerability vs runtime correlations on a 1-5 confidence scale, deep-dives into network blast radius or suspicious-binary VirusTotal lookups depending on the event class, and hands the case off to Jira or PagerDuty. Triggers on: "investigate runtime threat", "what is this Falco alert", "triage this SOC alert", "analyze runtime incident". Not for vulnerability prioritization (use sysdig-investigate) or remediation (use sysdig-remediate). allowed-tools: - Read - Write - Glob - Grep - AskUserQuestion - WebFetch - Bash(env*) - Bash(curl ipinfo.io) - Bash(curl ip-api.com) - Bash(curl events.pagerduty.com) - mcp__secure-mcp-server__get_skill_state - mcp__secure-mcp-server__save_skill_state - mcp__secure-mcp-server__list_runtime_events - mcp__secure-mcp-server__count_runtime_events - mcp__secure-mcp-server__discover_runtime_event_field_values - mcp__secure-mcp-server__get_event_info - mcp__secure-mcp-server__get_event_process_tree - mcp__secure-mcp-server__list_threats_engine_groups - mcp__secure-mcp-server__get_threats_engine_group - mcp__secure-mcp-server__get_threats_engine_threat - mcp__secure-mcp-server__list_threats_engine_threats_by_group - mcp__secure-mcp-server__list_threats_engine_resources_by_group - mcp__secure-mcp-server__run_sysql - mcp__secure-mcp-server__list_vulnerability_findings_by_image - mcp__secure-mcp-server__list_runtime_scan_results - mcp__secure-mcp-server__get_scan_result - mcp__secure-mcp-server__fetch_threat_intelligence_feed

First-run notice (Public Beta)

Before doing any other work for this skill, perform this one-time check:

  1. If ~/.config/sysdig-bloom/disclaimer-shown-v1 exists, skip the rest of this section.

  2. Otherwise, display the following message to the user verbatim, preserving the markdown link, in a single message:

    This plugin is a Public Beta release. It is provided “as is” and “as available,” without warranties of any kind. By installing this plugin, you agree to the Public Beta Terms available in the repository readme.

  3. Create the marker file ~/.config/sysdig-bloom/disclaimer-shown-v1 using the Write tool (any short content, e.g. the current UTC timestamp). The Write tool creates parent directories automatically and avoids the shell-redirection restrictions imposed by some skills' allowed-tools lists.

  4. Then continue with the user's request.

When you need to ask the user a question, get confirmation, or present choices, use the AskUserQuestion tool if available. This ensures proper rendering across all agent clients.

Input

Two invocation forms:

  • /sysdig-runtime-investigate — interactive. The skill surfaces the top-priority threats and asks you to pick one.
  • /sysdig-runtime-investigate <event_id> — directed. The skill investigates the given event/threat directly.

Principles

Glossary. This skill uses these terms with specific meanings — keep them straight in user-facing text and in the case body:

  • threat — a Threats Engine grouping (rule-level rollup of related events).
  • event — a single security signal: Falco syscall, CloudTrail entry, or k8s_audit record.
  • case — the report this skill produces and writes to /tmp.
  • incident — multiple correlated threats folded into the same case.
  • image / workload / host — compute targets at three levels: image artifact, Kubernetes workload (Pod, Deployment, etc.), and underlying VM or bare-metal host.

These are the rules of the game. Phases are a floor, not a ceiling. The chosen threat is a starting point — the goal is the full attack chain, which often spans multiple Threats Engine groups, multiple resources, and events that don't appear correlated at first.

  • Investigate freely. Phases are a floor; cross-cluster, cross-account, cross-threat-group correlation is expected when the signals support it.
    • Do: fold related threat groups into the same case; pivot to CloudTrail when you see IMDS theft; expand the time window when an AWS access-key creation lines up with a K8s threat.
    • Don't: stop at the first threat group; treat the trigger event as the whole story.
  • Keep the user informed. Drop a one-line status update between non-trivial calls — what you just found, what you're doing next.
    • Do: name the dimension you're pivoting to before the next call ("Process tree shows xmrig — looking for related cluster activity").
    • Don't: emit silent multi-call blocks; batch tool calls without narration.
  • Cite every claim. Every fact in the case body references its source — event ID, MCP (Model Context Protocol) tool name, REST path, or external URL.
    • Do: attach the source inline for every CVE, IOC (indicator of compromise), process, and timestamp.
    • Don't: paraphrase data without provenance; leave claims dangling.
  • Don't fabricate. If a CVE, IOC, or process didn't appear in the data, don't write it.
    • Do: omit fields the data didn't provide and record the limitation on the case object.
    • Don't: invent details to fill gaps; infer beyond what the data supports.
  • Two-tier output. The full case goes to a markdown file in /tmp. The user sees a 2-paragraph summary in chat plus the file path.
    • Do: keep tables, evidence dumps, and long traces in the file.
    • Don't: paste tables into chat; force the user to scroll for the verdict.
  • Read-only by design. This skill investigates and reports only. Remediation goes to sysdig-runtime-remediate (for live response actions on the running threat) or sysdig-remediate (for image-level CVE fixes).
    • Do: hand off to sysdig-runtime-remediate when the user wants to act on the live threat (kill, isolate, pause), or to sysdig-remediate when the fix is a vulnerable image.
    • Don't: run kubectl delete, terraform apply, or any destructive operation directly.
  • Always classify with MITRE. Assign one MITRE ATT&CK tactic to the trigger event — it drives the correlation gate and the watchlist mapping.
    • Do: pick the more specific tactic when two could apply; record the secondary on case.tactic_secondary.
    • Don't: skip classification; defer it to a later phase.
  • Always include event.id. Surface it in the file footer and any handoff payload — it's the audit trail.
    • Do: pass the event ID through Jira and PagerDuty payloads verbatim.
    • Don't: redact, abbreviate, or drop the event ID anywhere downstream.
  • Don't write cache or state to disk. The /tmp report file is the only persistent artifact.
    • Do: use MCP shared skill state for cross-skill context.
    • Don't: write working state, raw event payloads, or secrets to disk.

State

State is read and written via the Sysdig MCP server tools.

Operation Tool Arguments
Read state get_skill_state { "skill_state": "runtime-investigate" }
Write state save_skill_state { "skill_state": "runtime-investigate", "version": <n>, "data": { ... } }
Delete state delete_skill_state { "skill_state": "runtime-investigate" }

A null response from get_skill_state means no state exists yet — start with { "version": 0 }.

Schema

{
  "version": 1,
  "last_run": "2026-05-04T18:30:00Z",
  "preferred_jira_project": "RUNTIME",
  "preferred_handoff": "jira",
  "recent_cases": [
    {
      "event_id": "abc123",
      "cluster": "prod-eu-1",
      "tactic": "c2",
      "case_file": "/tmp/sysdig-runtime-investigate-abc123-20260504-1830.md",
      "started": "2026-05-04T18:30:00Z",
      "completed": "2026-05-04T18:42:00Z",
      "handoff": { "destination": "jira", "ticket_key": "RUNTIME-1234", "ticket_url": "https://example.atlassian.net/browse/RUNTIME-1234" }
    }
  ]
}

Read/write rules

  • Get at the start of every session via get_skill_state with { "skill_state": "runtime-investigate" }. A null response means no state — start with { "version": 0 }.
  • Save at the end of every session via save_skill_state with { "skill_state": "runtime-investigate", "version": <n>, "data": { ... } }. Read the current contents first, merge new data, then pass the full merged object as data.
  • Version argument — the server uses version for optimistic concurrency. Pass it as a separate argument (do not include it inside data):
    • First write (get_skill_state returned null) → call with version: 0.
    • Subsequent writes → call with the same version value the previous get_skill_state returned.
    • On 409 conflict → call get_skill_state again, merge your changes into the freshly-read state, retry once with the new version.
  • Matching keys for upsert in recent_cases: match on event_id. Cap the list at the 10 most recent entries; drop the oldest when over.
  • Timestamps use ISO 8601. The case_file path is informational only — files in /tmp may be cleaned up by the OS.
  • preferred_handoff is "jira" | "pagerduty" | "both" | "skip" | null. null means the user hasn't expressed a preference.

Steps

You run a 4-phase pipeline directly — no subagents.

Phase 0 ──→ Phase 1 ──→ Phase 2 ────→ Phase 3
Preflight   Surface     Investigate    Synthesise + report
                        (free-form)    (file + summary + handoff)

Phase 0 — Preflight

  1. Trust Preamble. Always present this before doing anything else. See references/trust-preamble.md for the full text. After presenting the preamble, proceed directly to Step 0a — do NOT ask for confirmation. The preamble is informational.

0a. MCP authentication preflight. Before any other step — including the state read below — run the preflight in references/auth-preflight.md and follow its instructions exactly. This skill requires the Sysdig MCP: it is the only source for Threats Engine, runtime events, process trees, SysQL, vulnerability scans, and the threat-intel feed. If the preflight returns State 2 (registered but not authenticated) or State 3 (not reachable), emit its verbatim message and stop — no data calls, no state read, no file writes. Do NOT call mcp__secure-mcp-server__authenticate autonomously. Only continue past this step on State 1 (catalog reachable).

0b. Read shared state. Call mcp__secure-mcp-server__get_skill_state with { "skill_state": "runtime-investigate" }. A null response means no prior state — start with { "version": 0 }. If state exists, hydrate preferred_jira_project, preferred_handoff, and recent_cases onto the working session so Phase 3 can default to the user's prior choices instead of re-asking.

  1. Reporting / CTI (cyber threat intelligence) probe (no-block). Detect available destinations and CTI tools dynamically. Do not require any specific env-var name — match by pattern.

    • Jira / case tracking: scan for MCP tools matching mcp__atlassian__*, mcp__*jira*, mcp__*linear*. Mark the first match.
    • Jira project key: scan env vars matching *JIRA_PROJECT* (e.g. SYSDIG_RUNTIME_JIRA_PROJECT). If matched, surface the value as the default project for Phase 3 handoff.
    • PagerDuty / on-call: scan env vars matching *PAGERDUTY*, *PD_TOKEN*, *PD_ROUTING_KEY*.
    • VirusTotal: scan env vars matching *VIRUSTOTAL*, *VT_API*, *VT_KEY*.

    Record what was found. Do not prompt yet if nothing was detected — defer that to Phase 3 handoff.

  2. Announce connectivity. Before moving on, surface a single user-facing line summarising what's wired up. Mark each integration (detected), (missing), or (probe failed). The Sysdig MCP is always here — Step 0a already guaranteed it. Example:

    • Connectivity: Sysdig MCP ✓ · Jira ✓ · PagerDuty — · VirusTotal —. Will skip binary VT enrichment and PagerDuty handoff.

    This is the only narration during Phase 0 — the per-step results above are recorded silently on the case object. Don't prompt; just inform.

  3. Entry-point detection. Parse the invocation argument:

    • No argument → interactive mode.
    • Anything else → directed mode, store the value as the event ID.
  4. Resume check (directed mode only). If recent_cases (from step 0b) has an entry whose event_id matches the current trigger, surface a one-line resume summary and ask the user via AskUserQuestion how to proceed:

    "You investigated <event_id> hours ago — case file <case_file>, handoff <destination>:<key> if any. Continue with the saved case, refresh (re-run Phase 2), or start fresh (clear the prior entry)?"

    • Continue → skip Phase 1 + Phase 2; jump to Phase 3 with the saved case (read the file from disk if it still exists; otherwise treat as refresh).
    • Refresh → keep the saved entry but re-investigate; merge new findings.
    • Start fresh → drop the prior entry from recent_cases and re-run.

    Staleness default. If the prior completed timestamp is more than 4 hours ago, runtime data has likely shifted — pre-select Refresh as the default. Below 4 hours, default to Continue.

    Interactive mode does not pause here — Phase 1 will surface fresh threats and any prior case will only show up in the picker as (seen <N>h ago).

Phase 1 — Surface

Interactive flow:

  1. List the top open threats — Threats Engine is exposed via MCP. Call mcp__secure-mcp-server__list_threats_engine_groups with limit=5, sorted by lastSignal:desc, status=open.

    If Threats Engine returns empty (unavailable or quiet in this tenant), fall back to runtime events for the same window: mcp__secure-mcp-server__list_runtime_events with last 24h, limit 10.

  2. Present the result as a markdown table:

    # Severity Rule / aiGeneratedName Resource Last seen

    Ask via AskUserQuestion: "Which one do you want to investigate?"

  3. Incident-scope detection at surface time. Before diving into the chosen threat, scan the other surfaced groups. Multi-stage attacks frequently span more than one Threats Engine grouping. Treat all groups sharing cluster + ±2h, OR aws.accountId + ±2h, OR same image as the same incident — one investigation, one case body, one narrative. Record them on case.incident_threat_groups.

    If these conditions hold, the chosen threat is one facet of a larger incident — fold all matching groups into the same case object. Tag them on case.incident_threat_groups (id, name, resource, last_seen, why_related). Phase 2's cluster-wide sweep then has a head start — these groups' constituent events should also appear in the sweep, but flagging them upfront lets the report's "Incident scope" section name them by their AI-generated title rather than as anonymous events.

Directed flow:

  1. Resolve the specific threat. The trigger ID may be a threat ID, a group ID, or an event ID — try in that order: mcp__secure-mcp-server__get_threats_engine_threat (single-threat detail); on 404 mcp__secure-mcp-server__get_threats_engine_group, then mcp__secure-mcp-server__list_threats_engine_threats_by_group with the ID; if all 404, fall back to mcp__secure-mcp-server__get_event_info with the event ID.

Classification — MITRE ATT&CK tactic. From the rule name, rule source, and event labels, assign one MITRE tactic to the threat. Store it on the case as case.tactic. Phase 2 watchlist mapping reads this value. See references/mitre-tactics.md for the keyword-to-tactic table and the secondary-tactic rule.

Process tree (Sysdig MCP). If the threat has an event_id (the Threats Engine returns securityEvent references with IDs), call mcp__secure-mcp-server__get_event_process_tree with the event ID to retrieve the structured process tree. Store the parsed result (parent → child chain, command lines, sha256 if present) on case.process_tree.

Process evidence from aiGeneratedDescription (always runs — useful even alongside the structured tree). The description carries natural-language context the structured tree doesn't ("locale repeatedly", "curl --upload to external IP"). Parse it for process names (e.g. systemd, sshd, bash, curl, wget, nc, nslookup) and chain hints ("spawned by", "child of"). Store as case.process_evidence (list of strings).

The two are complementary: case.process_tree is structured ground truth, case.process_evidence is the AI's narrative read of the same chain. The report renders both in "What happened" — the tree as a tree, the evidence as a one-liner.

Store the threat, classification, secondary tactic (if any), process tree (if available), and process evidence on the working case object.

Phase 2 — Investigate (free-form, signal-driven)

Goal: reconstruct the full attack chain starting from the user's pick. Span multiple threat groups, multiple resources, multiple event sources if the signals lead there. The chain is the deliverable — phase boundaries from earlier versions of this skill (e.g. separate enrichment / classifier / synthesis stages) are explicitly not prescribed steps anymore.

Tell the user what you're doing as you go. Examples of good status updates:

  • "Process tree shows Tomcat → bash → xmrig — looks like miner persistence. Looking for related cluster activity."
  • "IMDS (Instance Metadata Service) theft on the host. Expanding to CloudTrail in the same AWS account."
  • "Found two more threat groups on the same account in the same hour — folding them in as the same campaign."
  • "Runtime scan for the workload came back empty — pulling image-level vuln findings instead."

Available signals (chase them when relevant)

These are the ingredients. The order is yours.

  • Tenant-wide critical sweep (sanity check) — once early in Phase 2, call mcp__secure-mcp-server__list_runtime_events with filter_expr = "severity in (0,1,2,3)" and no scope filter across the ±2h window around the trigger. Catches cross-domain signals the cluster/account-filtered queries miss (GitHub cloudProvider.account.id, Okta cloudProvider.tenantId, anything without K8s labels). Fold in any hit whose image-org, repo name, or actor matches the trigger.
  • Process tree of the trigger event — mcp__secure-mcp-server__get_event_process_tree. Almost always the highest-yield single artifact. Falls back to aiGeneratedDescription parsing if the MCP returns empty.
  • Prior events on the affected resource — last 7 days, via mcp__secure-mcp-server__list_runtime_events with a filter_expr matching the workload (kubernetes.cluster.name + namespace + workload) or host (host.hostName). For K8s workloads, also pull host-level events on the same node — escapes hide there.
  • Cluster-wide activity in a ±2h window around the trigger. Same MCP tool, three filters in parallel:
    • kubernetes.cluster.name = "<cluster>" and source = "syscall" (other resources in the cluster)
    • kubernetes.cluster.name = "<cluster>" and source = "k8s_audit" (Attach/Exec Pod, Deployment Created, etc.)
    • kubernetes.cluster.name = "<cluster>" and source = "cloudtrail" (cluster-tagged cloud events, if any)
  • Cloud-account-wide activity when the resource has aws.accountId / azure.subscriptionId / gcp.projectId. CloudTrail / agentless-aws-ml / agentless-okta-ml events live under the account dimension, not the cluster. This is the difference between catching multi-stage cross-cloud attacks (IMDS credential theft → IAM access-key creation → CloudTrail tampering → S3 exfiltration) and missing them. Filter: aws.accountId = "<account>" and source in ("cloudtrail", "agentless-aws-ml").
  • Other threat groups in this incident — fold them into the same case — do not investigate separately. Other threat groups already tagged in Phase 1 as part of this incident — pull their constituent events via mcp__secure-mcp-server__list_threats_engine_threats_by_group and their affected resources via mcp__secure-mcp-server__list_threats_engine_resources_by_group (both keyed by the group ID) to merge into the chain. If new groups appear in the cluster window during Phase 2 investigation, fold them in too. Cross-type is allowed: a CLOUD threat may be the same incident as a K8S_WORKLOAD threat.
  • Sibling resources / posture / RBAC (role-based access control) via mcp__secure-mcp-server__run_sysql. SysQL schema differs between tenants — adjust query shape if rejected. Example queries: MATCH KubeWorkload AS wl WHERE wl.cluster = '<c>' RETURN wl.namespace, wl.name, MATCH Resource VIOLATES Control, MATCH KubeServiceAccount HAS KubeRoleBinding HAS KubeClusterRole.
  • Vulnerability surface — start with mcp__secure-mcp-server__list_vulnerability_findings_by_image using the image digest from the threat detail (server-side filtering by severity / exploit / fix). For running/in-use package detail — which vulnerable packages are actually loaded on the workload or host — use mcp__secure-mcp-server__list_runtime_scan_results (filter by kubernetes.cluster.name/namespace/workload or host.hostName, sort runningVulnTotalBySeverity) and pass the returned resultId to mcp__secure-mcp-server__get_scan_result for the full packages + vulnerabilities breakdown.
  • External CTI for the top 5 critical/high CVEs that pass the MITRE-tactic gate (see references/correlation-guide.md): NVD, CISA KEV (Known Exploited Vulnerabilities), Exploit-DB, GHSA (GitHub Security Advisory) via WebFetch. Don't fetch CTI for tactic-mismatched CVEs.
  • Sysdig threat-intel feed via mcp__secure-mcp-server__fetch_threat_intelligence_feed — Sysdig-curated CVEs / zero-days / active-attack notes. Cross-reference any IOCs you collect.
  • VirusTotal (VT) for binary IOCs when a SHA256 surfaces on event fields and a VT key is present (Phase 0 records the env var). When the threat lacks proc.sha256, look across other events on the same container — drift detection events typically carry the hash.
  • GeoIP for network IOCs via curl https://ipinfo.io/<ip>/json (fallback ip-api.com).

Patterns to recognise (not steps to execute)

When any of these rule names fire in the data you've pulled, that's the signal to expand. The action column is what direction to look, not a fixed query. See references/watchlist-patterns.md for the full hit-to-pivot table covering IMDS theft, persistent cloud creds, defense evasion, S3 exfil, privesc, lateral movement, runtime compromise, recon, and prompt-injection.

MITRE-tactic correlation

Read references/correlation-guide.md for the MITRE-tactic gate and 1–5 scoring heuristics. Render only pairs ≥ 4 in the case body. Boosts (cap at 5): KEV +1, VT malicious_count≥5 +1.

Stop conditions

You're done when:

  • The chain has a coherent narrative (initial access → execution → … → impact, or bounded by available data), AND
  • You've followed every watchlist hit at least one hop, AND
  • You've asked yourself "is there another threat group within the cluster + ±2h / account + ±2h window that belongs to this same incident?" — and either folded it in or recorded why not.

If you find yourself making more than ~4 follow-on calls per signal, stop and synthesise. Diminishing returns.

Phase 3 — Synthesise & report

Four steps, in this order:

  1. Ask the user for the report format via AskUserQuestion: "How do you want the report rendered? Markdown (default — plays nice with Jira/PagerDuty handoff) / HTML (renders Mermaid diagrams in browser, self-contained) / Both."

    If AskUserQuestion is not available, default to Markdown and mention HTML as a follow-up option in the summary.

  2. Write the file(s) to disk. Base path: /tmp/sysdig-runtime-investigate-<event_id_short>-<UTC-yyyymmdd-hhmm>.

    • Markdown → write <base>.md with Block 1 of references/reporting-templates.md verbatim. Always written when chosen — also the body of Jira/PagerDuty handoff downstream.
    • HTML → write <base>.html using Block 4 wrapper from the same file. The Block 1 markdown content is pasted into the wrapper's <script id="md"> block. Marked + mermaid CDN scripts render the page in the browser.
    • Both → write both files, same base name.
  3. Print a 2-paragraph summary to the user. Use Block 0 of reporting-templates.md — short, no tables. Cite the file path(s) so they can read it if they want. Then ask AskUserQuestion: "Where do you want to report this case?" with options derived from what Phase 0 detected (Jira, PagerDuty, Both, Just show it).

Handoff:

  • Jira → Build the Block 2 payload from the templates (body = file content). If the project key is unknown, call mcp__atlassian__getVisibleJiraProjects and ask the user. Preview before submitting. Render a short block — Project · Issue type · Summary · Severity · Labels · Body length (chars) — then ask via AskUserQuestion: "Submit, edit, or cancel?". Only on Submit call mcp__atlassian__createJiraIssue. Capture key + URL and surface them with an undo line: "Undo: open the ticket and click Close."

  • PagerDuty → Build the Block 3 payload (custom_details.case = file content, truncate to ~30 KB if larger). Preview before submitting. Render Routing key (last 4) · Severity · Dedup key · Summary · Source · custom_details size (KB) and ask via AskUserQuestion: "Submit, edit, or cancel?". Only on Submit send via:

    printf '%s' "$PAYLOAD_JSON" | curl -sS -X POST \
      -H "Content-Type: application/json" \
      --data-binary @- \
      https://events.pagerduty.com/v2/enqueue
    

    Capture the dedup_key and incident URL from the response. Surface them with an undo line: "Undo: resolve the incident with event_action: resolve and the same dedup key."

  • Both → run the Jira preview-and-submit first, then the PagerDuty preview-and-submit. Confirm each separately; declining one does not cancel the other.

  • Just show → no extra action; the file path was already cited in the summary.

If the user picks Edit at the preview step, ask which fields to change, apply the edits to the payload, and re-show the preview. Loop until the user picks Submit or Cancel.

After handoff, surface the link/key (if any) to the user — one short line.

  1. Offer to act on the threat (optional handoff to remediation). After the reporting handoff is complete (or skipped), ask via AskUserQuestion: "Want me to act on this threat now? sysdig-runtime-remediate will propose response actions (kill, isolate, pause, etc.), explain the consequences on the live workload, and execute only what you approve." Options: Yes — invoke sysdig-runtime-remediate / No.

    • On Yes — announce the handoff explicitly per principle C-03 ("Handing off to sysdig-runtime-remediate with <event_id>. The case file you just generated will be its starting context."), then invoke sysdig-runtime-remediate <event_id>.
    • On No — proceed silently to step 5. The user can always come back later with /sysdig-runtime-remediate <event_id> — the case file and shared state will still be there.

    Skip this step if sysdig-runtime-remediate is not registered in the current plugin (back-compat for environments running an older Bloom).

  2. Persist shared state. Before exiting, call mcp__secure-mcp-server__save_skill_state with { "skill_state": "runtime-investigate", "version": <version-from-phase-0>, "data": { ... } }. Merge the freshly-read state from Phase 0 with the new entries: append this case to recent_cases (upsert by event_id, cap at 10), update last_run, and update preferred_jira_project / preferred_handoff if the user picked a destination. Skip silently if the Sysdig MCP isn't loaded.

Error Handling

Status vocabulary. Every section of the case object uses the house status set: done, pending, in_progress, failed, skipped. The reason goes in a separate reason field so the report and the summary can render either status alone or status + reason.

{ "vulnerabilities": { "status": "skipped", "reason": "no scan data" } }
{ "cloudtrail": { "status": "skipped", "reason": "no integration in tenant" } }
{ "vt": { "status": "skipped", "reason": "no API key" } }
{ "cti.nvd": { "status": "failed", "reason": "rate-limited" } }

User-facing error template. Every error message the user sees follows three lines: what failed, why, and the fix (a copy-pasteable command or concrete next step). Keep the whole thing under four lines. Examples:

Can't reach Sysdig — the MCP server isn't registered or authenticated. Register it with claude mcp add --transport http secure-mcp-server https://<region>/mcp/secure, then run /mcp → Sysdig → Authenticate. Per-region URLs are in references/mcp-setup.md. Then re-run.

No threats matched the last 24h. Either the tenant is quiet or the time window is too tight. Widen the time window (e.g. 7 days) or pick a specific event ID.

Couldn't open the Jira ticket — Atlassian MCP returned 403. Your token doesn't have write:jira-work for project RUNTIME. Refresh the token in the Atlassian admin console and retry, or pick another destination.

Truncation, skipped enrichments, and other partial-success cases must be flagged in the 2-paragraph chat summary too — never let the user discover them only by reading the case file.

Situation Behavior
Sysdig MCP preflight returns State 2 or State 3 Abort in Phase 0a per auth-preflight.md — emit its verbatim message and stop. No data calls, no state read, no file writes.
Trigger ID resolves to neither a threat nor a group (get_threats_engine_threat, get_threats_engine_group, and list_threats_engine_threats_by_group all 404) Fall back to mcp__secure-mcp-server__get_event_info with the same ID. If that 404s too, tell the user the ID wasn't found and offer interactive mode.
No qualifying threats / events found in Phase 1 Tell the user, offer to widen the time window.
Vulnerability lookup empty When mcp__secure-mcp-server__list_vulnerability_findings_by_image returns no findings, try mcp__secure-mcp-server__list_runtime_scan_results (filter by workload/host) then mcp__secure-mcp-server__get_scan_result. If still nothing, record { "vulnerabilities": { "status": "skipped", "reason": "no scan data" } } and continue.
WebFetch to a public CTI source fails or rate-limits Mark that source { "status": "failed", "reason": "<rate-limited | network | 4xx>" } and continue. Don't retry in a loop.
VirusTotal API key not detected Record { "vt": { "status": "skipped", "reason": "no API key" } }.
No Jira and no PagerDuty detected Ask whether the user wants to configure one or just show the file path.
Atlassian MCP / PagerDuty curl errors during handoff Record { "handoff": { "status": "failed", "reason": "<error>" } }, surface the error, print the file path, offer to retry or switch destination.
mcp__secure-mcp-server__list_runtime_events cursor pagination drops filter_expr Known MCP quirk — paginating with cursor returns events from the wrong scope. Don't paginate; widen scope_hours and re-issue with the original filter.
A list_runtime_events call returns >100 events Tighten the time window or split the filter. Never silently drop the section — record { "status": "done", "truncated": true, "shown": <n>, "total": <m> }.
CloudTrail integration not present in tenant Cloud-account sweep returns empty. Record { "cloudtrail": { "status": "skipped", "reason": "no integration in tenant" } } on the case; flag in the summary that S3 / IAM / IMDS cloud-API signals were unavailable.
SysQL schema rejects a query Try the alternate relation shapes (HAS vs BINDS_TO, etc.). If all fail, record { "status": "failed", "reason": "schema mismatch" } and continue. The skill should not block on schema mismatches.
Install via CLI
npx skills add https://github.com/sysdig/skills --skill sysdig-runtime-investigate
Repository Details
star Stars 57
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator