lvms-ci-doctor - SKILL.md Agent Skill

name: lvms-ci:doctor argument-hint: [release1,release2,...] description: Analyze CI for LVMS periodic jobs and produce an HTML summary user-invocable: true allowed-tools: Skill, Bash, Read, Write, Glob, Grep, Agent

lvms-ci:doctor

Synopsis

/lvms-ci:doctor main
/lvms-ci:doctor 4.20,4.21,4.22,main

Description

Accepts a comma-separated list of release versions (or main), runs analysis for each release, and produces a single HTML summary file consolidating all results. Uses deterministic scripts for data collection, artifact download, aggregation, and HTML generation. LLM agents are used only for per-job root cause analysis.

Arguments

<ARGUMENTS> (required): Comma-separated list of release versions (e.g., main or 4.20,4.21,4.22,main)

Work Directory

Compute once at the start by running date +%y%m%d and substituting into the path below. In all commands, replace <WORKDIR> with the computed path — do not use shell variables.

/tmp/lvm-operator-ci-claude-workdir.<YYMMDD>

Implementation Steps

Step 1: Prepare — Collect and Download All Artifacts

Goal: Deterministically collect all failed jobs and download their artifacts before any LLM analysis.

Actions:

Determine today's <WORKDIR> by running date +%y%m%d and substituting into /tmp/lvm-operator-ci-claude-workdir.<YYMMDD>. Use this value in all subsequent commands.

Run the prepare script:

bash plugins/lvms-ci/scripts/doctor.sh prepare --component lvm-operator --workdir <WORKDIR> <ARGUMENTS>

The script deterministically:
- For each release: fetches failed periodic jobs, downloads artifacts, writes <WORKDIR>/jobs/release-<version>-jobs.json
- Outputs a JSON summary listing all releases, job counts, and file paths
Read the JSON output to know which releases have jobs to analyze and how many

Job JSON field names (use these exactly — do NOT guess alternatives like job_name):

job — full job name
build_id — unique build identifier
artifacts_dir — local path to downloaded artifacts
url — Prow job URL
status — job result (failure, FAILURE, SUCCESS, PENDING)

Error Handling:

If <ARGUMENTS> is empty, show usage and stop
If a release has no failed jobs, its jobs JSON will be an empty array — skip analysis for that release
If a release has an "error" field in the JSON summary, data collection failed for that release — report the error to the user but continue with other releases

Step 2: Analyze Each Job Using /lvms-ci:prow-job

Goal: Get detailed root cause analysis for each failed job using pre-downloaded artifacts.

Actions:

Use the JSON summary output from Step 1 to build agent prompts. Do NOT read the job JSON files into the main conversation — the prepare script already printed all job details (artifacts_dir, build_id, job name) and agents receive artifacts_dir directly in their prompt.

For every failed job across all releases, launch a separate Agent (using the Agent tool, NOT the Skill tool):

Agent: subagent_type=general_purpose, prompt="Analyze this Prow job and save the report:
1. Run /lvms-ci:prow-job <ARTIFACTS_DIR>
2. After the analysis completes, save the FULL report output (including the --- STRUCTURED SUMMARY --- block) to:
   <WORKDIR>/jobs/release-<RELEASE>-job-<N>-<JOB_ID>.txt
   Use the Write tool to save the file. The file must contain the complete analysis report."

Launch ALL agents in a single message as foreground agents (do NOT use run_in_background). Foreground agents in the same message run concurrently — this is just as fast as background agents but keeps your turn active until all complete.
Say "Analyzing N jobs in parallel..." in your message text alongside the Agent tool calls.
When all agents return, immediately proceed to Step 3 in the same turn. Do NOT stop or end your turn between Step 2 and Step 3.

Step 3: Finalize — Aggregate and Generate HTML Report

IMPORTANT: This step is MANDATORY. The task is incomplete without it. You MUST run this even if previous steps produced errors.

Goal: Deterministically aggregate results and generate the HTML report.

Actions:

Run the finalize script:

bash plugins/lvms-ci/scripts/doctor.sh finalize --component lvm-operator --workdir <WORKDIR> <ARGUMENTS>

The script deterministically:
- Runs aggregate.py for each release → summary.json files
- Runs create-report.py → report-lvm-operator-ci-doctor.html
Report the script's output to the user

Step 4: Report Completion

Actions:

Display the path to the generated HTML file
Summarize: failed job counts per release

Example Output:

Summary:
  Periodics:
    Release main: 3 failed periodic jobs
    Release 4.22: 0 failed periodic jobs

HTML report generated: <WORKDIR>/report-lvm-operator-ci-doctor.html

Examples

Example 1: Analyze Main Branch Only

/lvms-ci:doctor main

Example 2: Analyze Multiple Releases

/lvms-ci:doctor 4.20,4.21,4.22,main

Prerequisites

gsutil CLI must be installed for GCS access (uses anonymous access on public buckets)
Internet access to fetch job data from Prow/GCS
Bash shell, Python 3

Related Skills

lvms-ci:prow-job: Single job analysis (used by Step 2 agents)

Notes

Deterministic scripts handle: data collection, artifact download, aggregation, HTML generation
LLM agents handle: per-job root cause analysis (Step 2)
All agents are launched in a single parallel wave
The prepare script downloads all artifacts upfront so prow-job agents use local paths (no redundant downloads)
The finalize script runs aggregation and HTML generation in one call
All intermediate files use prescribed filenames in <WORKDIR> — no improvised names
The HTML report is self-contained (no external CSS/JS dependencies)
If a release analysis fails, it is noted in the report but does not block other releases