name: nvflare-diagnose-job description: "Diagnose failed, stalled, or suspicious NVFLARE jobs in simulation, POC, or production by collecting bounded evidence and mapping failure patterns to recovery actions." min_flare_version: "2.8.0" blast_radius: read_only skill_version: "0.1.0"
NVFLARE Diagnose Job
Use When
Use when the user asks why an NVFLARE job failed, stalled, timed out, ended with
EXECUTION_EXCEPTION, lost clients, produced suspicious logs, or needs failure
evidence interpreted.
Do Not Use When
Do not use for creating jobs, converting training code, submitting healthy jobs, monitoring a normal run, downloading results, production deployment, or generic Python debugging without NVFLARE job context.
Workflow
- Determine runtime mode first:
- simulation: user provides
job.py, SimEnv output, local logs, exported job folder, or a failedpython job.pyrun; - POC/production: user provides a job ID, startup kit, POC workspace, admin context, or asks about a running FLARE system.
- simulation: user provides
- If mode or evidence is ambiguous, ask for the missing mode, job ID, local log path, simulation output path, or startup-kit context before diagnosing.
- For simulation mode, inspect local artifacts only. Use
nvflare agent inspect <path> --format jsonwhen a project or job path is available, then read bounded local logs and generated job/config artifacts. For completed simulations, check the server workspace'ssimulate_job/metrics/directory formetrics_summary.jsonandround_metrics.jsonlbefore falling back to logs for metric evidence. - For POC/production mode, collect bounded job and system evidence through the
FLARE CLI, using
--tail,--since, or--max-bytesfor logs. For terminal jobs, usenvflare job download <job_id> -o <dir> --format jsonand readdata.artifacts.global_model,data.artifacts.metrics_summary, anddata.artifacts.round_metricswhen present. - Match evidence against the packaged failure-pattern catalog before interpreting raw logs.
- Report observed status, evidence quality, matched pattern, likely cause, confidence, recovery category, and concrete next action.
Requirements
- Must keep diagnosis read-only.
- Must distinguish simulation from POC/production before choosing evidence commands.
- Must use simulation server metrics artifacts when present and production
nvflare job downloadartifacts when available, instead of inventing metric or model paths. - Must keep log evidence bounded and report truncation or missing site logs.
- Must avoid confident root-cause claims when required site evidence is missing.
- Must not read private key contents, mutate jobs/configs/runtime state, or run unbounded scans.
Output Shape
Report:
- runtime mode and evidence sources;
- job status or local failure status;
- matched failure pattern and confidence;
- recovery category such as
FIXABLE_BY_CODE,FIXABLE_BY_CONFIG,ENVIRONMENT_FAILURE,RETRYABLE, orUNKNOWN; - source-aware evidence summary with site/process labels when available;
- next action and any missing evidence.
Load references/evidence-collection.md for mode-specific evidence collection
and references/failure-patterns.md before assigning a likely failure cause.