name: gotcha-systemd-restart-loop-bombs
description: USE WHEN adding Restart=on-failure to a systemd unit, or debugging cascading service outages after nixos-rebuild switch — a unit whose ExecStart fails fast respawns every RestartSec; cumulative pressure (FDs, GPU memory, scheduler) breaks the NEXT switch-to-configuration. Smoke-test ExecStart manually FIRST. StartLimitBurst doesn't protect against steady-state loops with restart intervals > window÷burst.
systemd Restart=on-failure units are restart-loop bombs if ExecStart is wrong
A unit with Restart=on-failure + an ExecStart that fails fast (e.g. invalid CLI flag) will respawn every RestartSec seconds indefinitely (or until StartLimitBurst saves you). Each respawn pays the full cold-start cost of whatever the binary needs — for bwrap-wrapped tools that's a fresh namespace, a fresh nix-shell setup, and a fresh process tree. Cumulatively, this can drain file descriptors, GPU memory, and scheduler capacity badly enough to break the NEXT nixos-rebuild switch.
Hit 2026-06-03 with systemd.user.services.claude-remote-control: ExecStart passed --dangerously-skip-permissions to claude remote-control (server-mode subcommand rejects that flag — it's only valid for interactive claude). The unit looped at 10s intervals for 5+ minutes. When nh os switch ran, the system was under enough pressure that:
caddy.servicehitStopped with result 'timeout'(couldn't shut down in the systemd timeout window)wayland-wm@hyprland.servicesame- NVIDIA GPU returned
NV_ERR_NO_MEMORYduring the user-session re-init - waybar logged
Too many open files
When systemd force-kills units it considers stop-failed, switch-to-configuration treats them as "in transition" and does not move to the start phase for them. Net result: ~30 services stopped, never restarted. Including Blocky (DNS), Caddy (reverse proxy), Authelia (SSO), Immich, Jellyfin, qBittorrent, Vaultwarden, the *arr stack, Ollama, ntfy-publishers — the entire user-facing surface.
The fix is structural, in the unit definition:
- Smoke-test
ExecStartBEFORE landing a unit withRestart=on-failure. Manually invoke the exact command string (env, args, working dir) and confirm it succeeds — especially for sandboxed wrappers (claude-box, opencode-box) where the bwrap cold start hides quick failures behind a 50-300ms latency. - Cap restart loops with
StartLimitIntervalSec+StartLimitBurstso an unverified unit can't loop forever. The claude-remote-control unit hadStartLimitBurst=5in 300s — but the loop kept running because each restart was scheduled OUTSIDE the burst window (10s interval × 5 = 50s, well under 300s); systemd resets the counter when 300s elapses without a burst. Lesson:StartLimitBurstonly protects against bursts, not steady-state loops with restart intervals > the window divided by burst count. For a steady-state cap, raiseRestartSecenough that the loop counts against the burst budget, or setRestart=on-failure+RestartMaxDelaySec=to back-off exponentially. - Prefer
Restart=nowhile testing a new unit, then flip toRestart=on-failureonce you've confirmed the happy path.
Diagnostics:
journalctl -u <unit> --since '1 hour ago' --no-pager | grep "Failed with result 'exit-code'"— count restarts per minute to spot loops.journalctl -b -p err --no-pager | grep -E "timeout|NV_ERR|EMFILE|Too many open files"— look for system-pressure symptoms in the activation window.systemctl status <unit>—Restart counter is at Nfield shows how many cycles have happened.
Trap to avoid: "the unit is just printing a usage error, that's harmless" — it ISN'T harmless when restarted every 10s for hours. The cumulative system pressure breaks unrelated things.