nvflare-diagnose-job - SKILL.md Agent Skill

name: nvflare-diagnose-job description: "Diagnose failed, stalled, or suspicious NVFLARE jobs in simulation, POC, or production by collecting bounded evidence and mapping failure patterns to recovery actions." min_flare_version: "2.8.0" blast_radius: read_only skill_version: "0.1.0"

NVFLARE Diagnose Job

Use When

Use when the user asks why an NVFLARE job failed, stalled, timed out, ended with EXECUTION_EXCEPTION, lost clients, produced suspicious logs, or needs failure evidence interpreted.

Do Not Use When

Do not use for creating jobs, converting training code, submitting healthy jobs, monitoring a normal run, downloading results, production deployment, or generic Python debugging without NVFLARE job context.

Workflow

Determine runtime mode first:
- simulation: user provides job.py, SimEnv output, local logs, exported job folder, or a failed python job.py run;
- POC/production: user provides a job ID, startup kit, POC workspace, admin context, or asks about a running FLARE system.
If mode or evidence is ambiguous, ask for the missing mode, job ID, local log path, simulation output path, or startup-kit context before diagnosing.
For simulation mode, inspect local artifacts only. Use nvflare agent inspect <path> --format json when a project or job path is available, then read bounded local logs and generated job/config artifacts. For completed simulations, check the server workspace's simulate_job/metrics/ directory for metrics_summary.json and round_metrics.jsonl before falling back to logs for metric evidence.
For POC/production mode, collect bounded job and system evidence through the FLARE CLI, using --tail, --since, or --max-bytes for logs. For terminal jobs, use nvflare job download <job_id> -o <dir> --format json and read data.artifacts.global_model, data.artifacts.metrics_summary, and data.artifacts.round_metrics when present.
Match evidence against the packaged failure-pattern catalog before interpreting raw logs.
Report observed status, evidence quality, matched pattern, likely cause, confidence, recovery category, and concrete next action.

Requirements

Must keep diagnosis read-only.
Must distinguish simulation from POC/production before choosing evidence commands.
Must use simulation server metrics artifacts when present and production nvflare job download artifacts when available, instead of inventing metric or model paths.
Must keep log evidence bounded and report truncation or missing site logs.
Must avoid confident root-cause claims when required site evidence is missing.
Must not read private key contents, mutate jobs/configs/runtime state, or run unbounded scans.

Output Shape

Report:

runtime mode and evidence sources;
job status or local failure status;
matched failure pattern and confidence;
recovery category such as FIXABLE_BY_CODE, FIXABLE_BY_CONFIG, ENVIRONMENT_FAILURE, RETRYABLE, or UNKNOWN;
source-aware evidence summary with site/process labels when available;
next action and any missing evidence.

Load references/evidence-collection.md for mode-specific evidence collection and references/failure-patterns.md before assigning a likely failure cause.