nvidia-ai-infrastructure-operations

star 18

Use this skill when reviewing NVIDIA AI infrastructure deployments — DGX, HGX, MGX systems, GPU server install posture, BMC and out-of-band exposure, BIOS/firmware levels, vGPU host configuration, and rack-scale power/cooling/networking readiness. Trigger when the user asks whether a GPU host is provisioned per NVIDIA reference architecture, whether the BMC is segmented, whether driver/firmware versions match the AI Enterprise support matrix, or whether the deployment is in scope for NCA-AIIO or NCP-AII certification expectations.

Raishin By Raishin schedule Updated 5/10/2026

name: nvidia-ai-infrastructure-operations description: Use this skill when reviewing NVIDIA AI infrastructure deployments — DGX, HGX, MGX systems, GPU server install posture, BMC and out-of-band exposure, BIOS/firmware levels, vGPU host configuration, and rack-scale power/cooling/networking readiness. Trigger when the user asks whether a GPU host is provisioned per NVIDIA reference architecture, whether the BMC is segmented, whether driver/firmware versions match the AI Enterprise support matrix, or whether the deployment is in scope for NCA-AIIO or NCP-AII certification expectations. allowed-tools: Read Grep Glob metadata: author: "github: Raishin" version: "0.1.0" updated: "2026-05-10" category: platform

NVIDIA AI Infrastructure Operations Review

Purpose

Review NVIDIA GPU infrastructure deployments (DGX, HGX, MGX, certified OEM systems) against NVIDIA reference architectures and the NCA-AIIO / NCP-AII certification body of knowledge. Anchor judgments on driver + firmware + CUDA toolkit + AI Enterprise support matrix alignment, BMC/iDRAC/iLO segmentation, and host-level GPU configuration (persistence mode, ECC, MIG capability, vGPU).

Lean operating rules

  • Prefer live evidence (nvidia-smi, nvidia-smi -q, dmidecode, ipmitool lan print, dcgmi diag) when the active client exposes it; otherwise fall back to NVIDIA Enterprise Support documentation, sanitized topology diagrams, and the AI Enterprise compatibility matrix.
  • Separate confirmed facts from inference. If BMC network segmentation, firmware level, or driver-toolkit match was not directly queried, say so.
  • Treat a BMC / iDRAC / iLO interface reachable from a tenant or workload network as a critical finding. GPU hosts hold model weights and tenant data; OOB compromise is total compromise.
  • Treat driver / CUDA / cuDNN versions outside the published NVIDIA AI Enterprise support matrix as a high finding — silent ABI breakage and unsupported workloads.
  • Treat ECC disabled on production GPUs as a high finding for training workloads (silent corruption of weights or gradients).
  • Treat persistence mode disabled on long-running inference hosts as a medium finding (driver re-init latency at first call).
  • Treat MIG-capable GPUs running in default whole-GPU mode in a multi-tenant cluster as a medium finding — partitioning is the isolation primitive.
  • Treat absent or unverified firmware bundle (HGX baseboard, NVSwitch, BMC) as a high finding for any deployment with regulated or high-value workloads.

References

Load these only when needed:

  • NVIDIA AI Enterprise support matrix
  • DGX/HGX system user guides for the deployed generation
  • NCA-AIIO and NCP-AII exam blueprints

Response minimum

Return, at minimum:

  • the scoped target (host class, generation, AI Enterprise version) and evidence level,
  • driver / CUDA / cuDNN / firmware posture vs the support matrix,
  • BMC / OOB segmentation posture,
  • ECC / persistence / MIG posture per GPU,
  • the safest next actions and any assumptions or blockers.
Install via CLI
npx skills add https://github.com/Raishin/vanguard-frontier-agentic --skill nvidia-ai-infrastructure-operations
Repository Details
star Stars 18
call_split Forks 2
navigation Branch main
article Path SKILL.md
More from Creator