sysdig-investigate

name: sysdig-investigate description: > Investigate vulnerable images in a Sysdig-monitored environment. Fetches and ranks images by a chosen risk metric (finding_count, exposure_time_weighted, exposure_time_avg, sla_compliance, or actually_exploitable_findings), builds a remediation plan, optionally creates a tracking ticket (Jira / Linear / GitHub Projects) using Sysdig-side signals to determine the assignee, and hands off to /sysdig-remediate. Triggers on: "investigate", "what should I fix", "show me vulnerable images", "prioritize vulnerabilities", "/sysdig-investigate". Not for opening PRs, applying code fixes, or generating Dockerfile patches — use /sysdig-remediate for that. allowed-tools: - AskUserQuestion - Read - Write - mcpsecure-mcp-serverget_customer_settings - mcpsecure-mcp-serverget_skill_state - mcpsecure-mcp-serversave_skill_state - mcpsecure-mcp-serverdelete_skill_state - mcpsecure-mcp-serverlist_plans - mcpsecure-mcp-serverlist_plan_remediation_jobs - mcpsecure-mcp-serverlist_zones - mcpsecure-mcp-serverlist_vulnerability_findings_by_image - mcpsecure-mcp-serverlist_candidate_remediation_jobs - mcpsecure-mcp-serverrun_sysql - mcpsecure-mcp-servercreate_plan - mcpatlassiansearchJiraIssuesUsingJql - mcpatlassiancreateJiraIssue - mcpatlassianeditJiraIssue - mcpatlassianaddCommentToJiraIssue

First-run notice (Public Beta)

Before doing any other work for this skill, perform this one-time check:

If ~/.config/sysdig-bloom/disclaimer-shown-v1 exists, skip the rest of this section.
Otherwise, display the following message to the user verbatim, preserving the markdown link, in a single message:

This plugin is a Public Beta release. It is provided “as is” and “as available,” without warranties of any kind. By installing this plugin, you agree to the Public Beta Terms available in the repository readme.
Create the marker file ~/.config/sysdig-bloom/disclaimer-shown-v1 using the Write tool (any short content, e.g. the current UTC timestamp). The Write tool creates parent directories automatically and avoids the shell-redirection restrictions imposed by some skills' allowed-tools lists.
Then continue with the user's request.

When you need to ask the user a question, get confirmation, or present choices, use the AskUserQuestion tool if available. This ensures proper rendering across all agent clients.

Investigate vulnerable images in a Sysdig-monitored environment in four phases: discover the candidates (existing plan when sage.next is enabled, or zone-based search on the legacy path), prioritize by a focus metric, optionally ticket them in your tracker, and hand off to /sysdig-remediate for the fix. This skill never opens PRs or applies fixes — that work lives in /sysdig-remediate.

To apply the fix, run /sysdig-remediate after this skill hands off. /sysdig-remediate resolves safe fix versions, opens a PR/MR, and updates the linked ticket on completion.

Conversation rules

Narrate before every tool call. Before invoking any tool, say what you're about to do and which tool you're using. No silent calls.
Announce every skill handoff. Before invoking another skill, name it explicitly and summarize what it'll do, then wait for confirmation.

State

Read state via get_skill_state, write via save_skill_state. Schema and rules: see references/state.md. Treat null as { "version": 0 }.

Steps

0. Trust preamble

Always present this before asking any questions. See references/trust-preamble.md for the full text. After presenting the preamble, proceed directly to step 0a — do not ask for confirmation.

0a. Prerequisites and routing

MCP authentication preflight. Before any other step, run the preflight in references/auth-preflight.md and follow its instructions exactly. If it tells you to abort, abort — do not call any MCP tools or perform other side effects.

If a tool call later fails during normal operation, use the diagnostic checklist in references/mcp-setup.md to identify the specific failure.

Do not proceed until the MCP server is reachable.

Route on sage.next.enabled. From the same get_customer_settings response, read the sage.next.enabled flag:

If true → continue to step 1 (plan-based hot path).
If false → jump to step 1L (legacy fallback).

1. Plan-based flow — pick an existing plan or start fresh

a. List existing plans. Using the tool list_plans, check if there are existing plans the company is working on. A plan is a tracked set of remediation jobs (one job per image to fix) within a chosen scope — one or more zones (groups of resources sharing a policy, e.g. production or staging) — and a chosen target_measure. Present the plans to the user and ask whether they'd like to work on one of them, or "none of these — start a free investigation."

b. Branch on the user's choice.

If the user picks an existing plan → fetch and present its images using list_plan_remediation_jobs with the selected plan_id, then jump to step 4a.
If the user picks "none" → proceed to step 2 (zones) for a free investigation.

1L. Legacy fallback (only when `sage.next.enabled` is false)

Refer to zones to fetch user available zones. Ask the user to pick a zone. Always include a zone "Entire Infrastructure" (no zone).

Then, using the tool list_vulnerability_findings_by_image and the user-selected zones IDs, fetch the 10 most vulnerable images. Zone IDs must be placed in the tool param zoneId_in.

After listing, tell the user how much they're seeing. If the API response includes a total count, print "Showing top 10 of N — say 'more' to expand or describe a filter."; otherwise print "Showing the top 10. Say 'more' to fetch additional results or 'filter' to narrow." When the user says "more", re-call list_vulnerability_findings_by_image with a higher limit.

Ask: "Which of these would you like to remediate? (say 'all', pick numbers, or describe a filter)" Skip to step 5.

2. Discover zones

Refer to zones to fetch user available zones. Ask the user to pick a zone. Always include a zone "Entire Infrastructure" (no zone). Please, use the zone ID and not the zone name passing the choice to other tools.

3. Choose focus

Ask the user what they want to focus on:

"What would you like to investigate?"

Most CVEs (finding_count) — total distinct CVE+package findings.

Longest exposure (exposure_time_weighted) — older findings weigh more.

Average age (exposure_time_avg) — average age of Critical+High findings.

SLA risk (sla_compliance) — ranks by oldest-bucket age vs. SLA threshold (as computed by Sysdig).

Actually exploitable (actually_exploitable_findings) — in-use AND network-reachable.

Map the user's choice to the target_measure parameter (the parenthetical above).

4. Fetch and present images

Call list_candidate_remediation_jobs with:

target_measure: the selected target_measure
scope: a JSON like { zones: [] }
limit: 10

If the result is empty, tell the user there are no matching vulnerabilities in that environment and stop.

4a. Fetch and present images

Present the results as a ranked table, sorted by internet-exposed first, then by severity score descending:

| # | Image | Ranking Summary | Finding Percentage | Finding Count | Resource Count |---|-------|--------|--------------------|---------------| | 1 | quay.io/org/app:1.2 | 9 actually exploitable findings | 30% | 17 | 23 | 2 | quay.io/org/svc:2.0 | 16 actually exploitable findings | 12% | 43 | 44

After the table, tell the user how much they're seeing. If the API response includes a total count, print "Showing top 10 of N — say 'more' to expand or describe a filter."; otherwise print "Showing the top 10. Say 'more' to fetch additional results or 'filter' to narrow." When the user says "more", re-call list_candidate_remediation_jobs with a higher limit.

Ask: "Which of these would you like to remediate? (say 'all', pick numbers, or describe a filter)"

5. Build a remediation plan

For the selected images, fetch information about the image using this SysQL query template:

if the selected metric is actually_exploitable_findings run this query:

MATCH KubeWorkload HAS Container RUNS Image AFFECTED_BY Vulnerability
  WHERE Image.imageId = '<image_id>' 
  AND Vulnerability.severity IN ['Critical', 'High'] 
  AND Vulnerability.inUse = true 
  AND Vulnerability.hasFix = true 
  AND KubeWorkload.isExposed = true
  RETURN DISTINCT Image, count(DISTINCT KubeWorkload), count(DISTINCT Vulnerability);

for all other metrics:

MATCH KubeWorkload HAS Container RUNS Image AFFECTED_BY Vulnerability
  WHERE Image.imageId = '<image_id>' 
  AND Vulnerability.severity IN ['Critical', 'High']
  RETURN DISTINCT Image.imageReference, 
  count(DISTINCT KubeWorkload) AS workloads_count, 
  KubeWorkload.isExposed, 
  RETURN DISTINCT Image, count(DISTINCT KubeWorkload), count(DISTINCT Vulnerability);

When more than one image is selected, present a plan table for explicit user approval before proceeding:

## Remediation Plan — <environment>

| # | Image | Workloads Count | Workload Exposed | Fixables | Critical (total) | High (total) |
|---|-------|-------|-----------|---------|-----|-----|
| 1 | quay.io/org/app:1.2 | 12 | Yes | 23 | 3 | 20 |
| 2 | quay.io/org/svc:2.0 | 3  | No  | 12 | 1 | 11 |

Total: 2 images selected.

If the users is doing a free investigation, ask him to create a plan with the selected filters (zone + metric). Use the create_plan tool and ask the user all needed parameters. Once done: Ask: "Which one do you want to fix?"

Wait for explicit approval. The user can say "skip #2", "only exposed ones", etc.

5b. Optional ticketing

Refer to ticketing for supported systems and required configuration.

Detect available ticketing systems

Check whether any of these are reachable via MCP or CLI:

Jira — jira-mcp-server MCP tools (jira_search_issues, jira_create_issue, jira_update_issue) or jira CLI.
Linear — linear MCP tools or linear CLI.
GitHub Projects — github MCP tools (add_project_item) or gh project CLI.

Record the detected system as ticketing_system in state. If none are found, set ticketing_system: null.

Ask whether to create tickets

Ask the user:

"Do you want to create tracking tickets for any of these images? (yes / no / pick which)"

Ticketing is fully optional. If the user says no — or no ticketing system was detected and the user does not want to configure one — skip the rest of this step entirely and go to step 6 with no ticket_key.

If a ticketing system is available but the user wants to use a different one, ask them to configure it; if they decline, proceed without ticketing.

If the user wants to use a system that is detected but missing credentials (token, project, user, etc.), ask for the missing configuration before proceeding.

For each image where the user wants a ticket

a. Search for existing tickets

Before creating a new ticket, search the configured system for existing tickets that reference the same image:

Search by image name in ticket summaries (e.g. summary ~ "<image-name>").
Also search the image reference in ticket descriptions.

If existing tickets are found:

If any are still open, propose updating the existing ticket instead of creating a duplicate. Show the ticket summary and ask the user to confirm.
Extract the assignee from the most recent ticket for this image and record it as previous_ticket_assignee for use in the assignee priority chain below.

b. Determine assignee (Sysdig-side signals only)

Use the first signal that yields a result. Do not use git log / file authors here — those are PR reviewers and live in /sysdig-remediate.

workload_owner — owner annotation/label on the running workload. Query via SysQL, e.g.:

MATCH KubeWorkload HAS Container RUNS Image
  WHERE Image.imageReference CONTAINS '<image_name>'
  RETURN DISTINCT KubeWorkload.labels, KubeWorkload.annotations;

Inspect labels/annotations like owner, team, app.kubernetes.io/owner.

zone_owner — if the selected zone defines an owner, use it.
previous_ticket_assignee — from ticket_assignees state for this image, or from a prior open ticket discovered in step 5b.a.
Leave unassigned.

Present the proposed assignee with its source, e.g.:

"Suggesting @platform-team as assignee — workload has label team: platform-team."
"Suggesting @jane.doe — they were assigned to the previous ticket (PROJ-100) for this image."

Always confirm with the user before setting the assignee. Record the choice and its source in ticket_assignees.

c. Create or update the ticket

Show the draft to the user before any write operation.

Summary: [Sysdig] Fix Critical/High vulnerabilities in <image_reference>

Description (Jira / Markdown — adapt syntax for Linear / GitHub Projects):

h2. Vulnerability Report

This ticket was created by the Sysdig investigate skill after scanning
the *<environment>* environment on <date>.

h2. Vulnerable Image

||Property||Value||
|Image|{noformat}<image_reference>{noformat}|
|Base OS|<base_os>|
|Environments|<comma-separated list>|
|Affected workloads|<workloads_count> (<workloads_internet_exposed_count> internet-exposed)|

h2. Critical & High CVEs

||CVE||Severity||Package||Installed||Fix Version||CVSS||Exploitable||
|<cve_id>|<severity>|<package>|<installed_version>|<fix_version or "none available">|<cvss>|<yes/no>|
(repeat for each CVE)

h2. Impact Assessment

<actually_exploitable_explanation>

*Network exposure:* <network_mitigated_explanation>
*Acceptable risk:* <has_acceptable_findings_explanation>

h2. Recommended Actions

For each CVE, describe what needs to happen:
- *<CVE-ID>* (<severity>): Update <package> from <installed_version> to a safe target version.
- *<CVE-ID>* (<severity>): No fix available — monitor for upstream patch.

h2. Next Step

Run `/sysdig-remediate <image_reference> (image_id: <image_id>, ticket: <THIS_TICKET_KEY>)`
to attempt a code fix; this ticket will be updated automatically with the PR link
on completion.

h2. References

- Sysdig investigation global_id (unique candidate-job identifier in Sysdig): {noformat}<global_id>{noformat}
- Detection date: <date>

Priority:

severity_normalized >= 0.9 → Critical
severity_normalized >= 0.7 → High
otherwise → Medium

Updating an existing ticket (when step 5b.a found one): never remove or modify the existing description. Append below a separator and a new section:

----

h2. Update — <date>

_Added by Sysdig investigate skill._

h3. New/Updated CVEs
... (only include changes since the last update)

h3. Recommended Actions
- <updated action items>

After the create or update operation, record the result in state:

Append/upsert into tickets (matched by ticket_key).
Append into ticket_history (always append).
Attach ticket_key to that image's plan entry so step 6 can pass it to /sysdig-remediate.

6. Hand off to `/sysdig-remediate`

Before the first invocation, announce the handoff and wait for confirmation:

"Investigation complete. Handing off to /sysdig-remediate for <image_reference> — it will analyze fix versions and open a PR. Continue?"

For each approved image (in order), invoke /sysdig-remediate passing the image reference, image_id, and the optional ticket_key if a ticket was created or matched in step 5b:

/sysdig-remediate <image_reference> (image_id: <image_id>)
/sysdig-remediate <image_reference> (image_id: <image_id>, ticket: <ticket_key>)

The ticket argument is optional. When present, /sysdig-remediate will update that ticket with the PR link on completion. When absent, /sysdig-remediate opens the PR without touching any ticketing system.

Branching paths

After step 5b, the user is on one of these paths:

investigate → ticket → stop — ticket created, someone else will pick it up later and run /sysdig-remediate <image> (..., ticket: <key>).
investigate → ticket → remediate — ticket created, immediately hand off; remediate updates the ticket on PR open.
investigate → remediate (no ticket) — user skipped ticketing entirely; remediate opens the PR with no ticket update.
investigate (no ticket) → stop — user reviewed the plan and chose not to act now.

After each image completes, update its plan entry status to remediated (or skipped if the user chose to skip it), then ask whether to continue with the next image.

7. Save state

Call the MCP tool save_skill_state with { "skill_state": "investigate", "version": <n>, "data": { ... } }. Refer to the State section above for the full schema. Persist in data:

last_run, environment, focus, images_found, images_planned, images_remediated, plan
ticketing_system (or null if the user skipped ticketing)
ticket_assignees, tickets, ticket_history (only if any ticket activity happened)

Version on write: pass the same version value returned by the get_skill_state call at the start of the session — or 0 if the call returned null (no prior state). The server bumps the version itself. See Read/write rules. Do not include version inside data.

Save state even if the session ended before remediation started — the plan entries with status: "pending" allow the user to resume from where they left off.

On a 409 conflict, call get_skill_state again, merge the plan entries (upsert by global_id) into the freshly-read state, and retry once with the new version.

Error handling

Every failure surfaced to the user follows the same what / why / fix shape, modeled on the verbatim MCP-unreachable message in step 0a:

What — the specific operation that failed.
Why — the underlying cause in concrete terms.
Fix — the exact next action or command, copy-pasteable.

Apply this template to:

SysQL failures in step 5 (e.g. malformed query, MCP timeout, empty result against an expected non-empty workload).
list_candidate_remediation_jobs errors in step 4 beyond the "empty result" case already documented (e.g. 4xx/5xx, scope mismatch, missing zone access).
save_skill_state conflicts in step 7 — after one merge-and-retry attempt fails, surface the conflict with what/why/fix instead of silently retrying.
Ticketing write failures in step 5b — per-image, with the ticket system, the operation (create vs. update), and the system's error reason.

For batched ticket operations (when the user said "create tickets for all"), use a partial-success report instead of bundling everything into a single success or single failure:

"Created 3 of 5 tickets. The other 2 failed: quay.io/org/app:1.2 (Jira: assignee not in project), quay.io/org/svc:2.0 (Jira: 401 unauthorized). Want details on any?"

Never report partial success as plain "success".

Important rules

Always read state at the start (via script) and write state at the end — even for short sessions.
Keep the conversation focused: one environment per session.
Do not perform fix analysis or open PRs here — that is the job of /sysdig-remediate.
Always sort the image table: internet-exposed first, then by severity score descending.
When the user says "all", still present the plan table and ask for explicit confirmation before handing off to /sysdig-remediate.
Never invoke /sysdig-remediate on an image without the user's explicit approval.
Ticketing is always optional — proceed without a ticket whenever the user declines or no ticketing system is configured.
For ticket assignees, use Sysdig-side signals only (workload_owner, zone_owner, previous_ticket_assignee). Never use git log / file authors here — those belong to PR review in /sysdig-remediate.
Always search for an existing open ticket before creating a new one. Prefer updating over duplicating.
When updating an existing ticket, never remove or modify the original description — append below a ---- separator.
Never set an assignee without confirming with the user first.
Respect the user's "no". When the user declines an image, ticket suggestion, or proposed assignee, mark the decision (e.g. set the matching plan entry's status to skipped) and never re-suggest the same item within the session.