name: k8s-diagnostic description: diagnostic routines for troubleshooting k8s home lab issues
K8s Diagnostic Routine
Use this skill to diagnose common failures in a Kubernetes home lab environment.
Diagnostic Steps
- Check Pod Status:
kubectl get pods -n <namespace> - Describe for Events: Use this for
CrashLoopBackOfforPendingpods.kubectl describe pod <pod-name> -n <namespace> - Inspect Logs: Check both current and previous logs if a pod is crashing.
kubectl logs <pod-name> -n <namespace> --tail=50 kubectl logs <pod-name> -n <namespace> --previous - Node Affinity & Resources: For
Pendingpods, check node resources and affinities (especially for GPU/NVIDIA nodes).kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable."nvidia\.com/gpu"
Advanced Observability
If pod-level diagnostics are insufficient, use the cluster's observability stack:
Prometheus (Metrics & Querying)
- Access: Available on the local network via its
HTTPRoutehostname. - Proxy: If the hostname is unreachable, use a port-forward:
kubectl port-forward -n monitoring svc/prometheus-k8s 9090 - Usage: Query for resource pressure, networking issues, or specific application metrics to identify trends or correlates for failures.
Loki (Log Aggregation)
- Usage: Use Loki when
kubectl logs(even with--previous) is insufficient or when you need to correlate logs across multiple replicas or services. - Access: Typically accessed via the Grafana dashboard.
Common Failure Patterns
NFS Provisioning Failure (
mount.nfs: Cannot allocate memory):- Symptoms: Pods stuck in
ContainerCreatingor PVCs stuck inPendingwith events showing "failed to provision volume ... mount failed: exit status 32 ... Cannot allocate memory". - Rectification: Restart the NFS server deployment:
kubectl rollout restart deploy -n nfs-server
- Symptoms: Pods stuck in
NFS Mount Timeout (
MountVolume.SetUp failed ... rpc error: code = Internal desc = time out):- Symptoms: Pods stuck in
ContainerCreatingwith FailedMount events showing RPC timeout. New mount commands from the CSI node plugin hang indefinitely. - Key info: The NFS server runs in namespace
nfs-server. The CSI NFS driver pods (csi-nfs-node-*,csi-nfs-controller-*) run inkube-system. The StorageClass iscrobasaurusrex-nfsusing provisionernfs.csi.k8s.iowithnfsvers=4.1. - Diagnosis (escalate through these steps):
- Check if the NFS server pod is running:
kubectl get pods -n nfs-server - Verify NFS exports are working inside the server (NFSv4 does NOT need mountd, so
showmount -ereturning "Program not registered" is normal):
Look for programkubectl exec -n nfs-server deploy/nfs-server -- exportfs -v kubectl exec -n nfs-server deploy/nfs-server -- rpcinfo -p localhost100003(nfs) in rpcinfo output. Mountd (100005) NOT being registered is expected for NFSv4-only. - Check if the CSI NFS node plugin on the affected node can actually mount:
# Find the CSI node plugin pod on the affected node kubectl get pods -n kube-system -l app=csi-nfs-node -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName # Test a manual mount from inside it (exit code 124 = timeout) kubectl exec -n kube-system <csi-nfs-node-pod> -c nfs -- sh -c "mkdir -p /tmp/nfs-test && timeout 10 mount -t nfs4 -o nfsvers=4.1 nfs-server.nfs-server.svc.cluster.local:/ /tmp/nfs-test && echo SUCCESS && umount /tmp/nfs-test" - If the manual mount times out, check for stale kernel NFS state on the node:
kubectl exec -n kube-system <csi-nfs-node-pod> -c nfs -- cat /proc/fs/nfsfs/servers kubectl exec -n kube-system <csi-nfs-node-pod> -c nfs -- cat /proc/mounts | Select-String "nfs"
- Check if the NFS server pod is running:
- Root cause: When the NFS server pod restarts, existing kernel NFS client connections on nodes become stale. The Linux kernel NFS client holds onto the dead session and blocks ALL new NFS mounts on that node, not just the original ones.
- Rectification (in order of escalation):
- First try restarting the NFS server:
kubectl rollout restart deploy -n nfs-server - Then restart the CSI NFS node plugin on the affected node:
kubectl delete pod -n kube-system <csi-nfs-node-pod> - Delete and recreate the affected pod to get a fresh mount attempt.
- If all above fail (stale kernel NFS state): reboot the affected node. This is the only reliable way to clear stale kernel NFS client state. The kernel NFS client does not recover on its own once stuck.
- First try restarting the NFS server:
- Prevention (Soft Mounts): To prevent this kernel deadlock from happening on future
nfs-serverrestarts, the cluster's NFSStorageClasshas been updated to usesoftmounts with a shorter timeout (timeo=50). Instead of hanging permanently (the default behavior ofhardmounts), the kernel will abort broken connections and return I/O errors, preventing the phantom session from locking up the node's mount table permanently.
- Symptoms: Pods stuck in
ImagePullBackOff: Verify the tag exists and Renovate bot hasn't pushed a non-existent version.
OOMKilled: Compare
kubectl top podresults with theresources.limitsdefined in the manifest.NVIDIA GPU Plugin Failure (
failed to initialize NVML: Driver/library version mismatch):- Symptoms: Pods like
nvidia-device-plugin-daemonsetcrashlooping or GPU-dependent pods stuck inPendingdue to missing GPU capacity. Pod logs show:failed to initialize NVML: Driver/library version mismatchorOCI runtime create failed. - Root Cause: The host node's NVIDIA GPU drivers/libraries were upgraded (e.g., via host updates), but the running kernel has not loaded the new module version, creating a mismatch with the user-space libraries.
- Rectification: Reboot the affected GPU node (e.g.
croblodocus) to load the matching kernel module and sync it with the libraries.
- Symptoms: Pods like