dh-troubleshoot - SKILL.md Agent Skill

name: dh-troubleshoot type: on-demand trigger: - /dh:troubleshoot description: Systematic debugging for infrastructure issues. Use when user reports errors, app not working, 502/504 errors, certificate problems, DNS issues, or says "debug", "not working", "help", "broken", "error".

Dockhand Troubleshoot

Systematic debugging guide for infrastructure issues. Supports quick triage for common problems and deep diagnostic chains for complex issues.

Quick Triage Mode

For common errors, run targeted diagnostics:

HTTP 502/504 (Bad Gateway)

1. traefik_status → Check if router exists for domain
2. dokploy_app_status → Verify app is running
3. ssh_exec "docker logs <container> --tail 50" → Check app logs

Common fixes:

Container crashed → dokploy_redeploy
Wrong port → Update Traefik config
App startup failure → Check env vars

Certificate Errors (HTTPS)

1. traefik_check_certs domain="<domain>" → Check cert status
2. dns_get_record name="<domain>" → Verify DNS points to Traefik
3. ssh_exec "docker logs traefik --tail 100 | grep acme" → ACME logs

Common fixes:

DNS not propagated → Wait, check dns_check_propagation
Rate limited → Wait 1 hour, use staging CA
Wrong challenge type → Verify HTTP-01 port 80 open

Cloudflare 521 (Origin Down)

1. ssh_exec "curl -I localhost:<port>" on platform-core → Test local
2. traefik_status → Verify Traefik running
3. ssh_exec "docker ps | grep traefik" → Container status

Common fixes:

Traefik not running → Restart via Dokploy
Firewall blocking → Check Hetzner firewall rules
Network misconfiguration → Verify dokploy-network

App Not Starting

1. dokploy_app_status → Get deployment status
2. ssh_exec "docker logs <container> --tail 100" → Startup logs
3. ssh_exec "docker inspect <container>" → Check config

Common fixes:

Missing env var → Add via Dokploy UI
Volume permission → Fix ownership
Resource exhaustion → Check check_resource_thresholds

Deep Diagnostic Chain

For complex issues, run full diagnostic sequence:

Step 1: Infrastructure Health

check_resource_thresholds → All hosts CPU/memory/disk
monitoring_health → Prometheus/Grafana/Loki status

Step 2: Edge Layer (Traefik)

traefik_status → All routers and services
traefik_check_certs → All certificate status
ssh_exec "docker logs traefik --tail 200" → Recent logs

Step 3: Orchestration Layer (Dokploy)

dokploy_list_apps → All managed applications
dokploy_app_status app_id=<id> → Specific app details

Step 4: Container Layer (Docker)

docker_state host="<host>" → Full container inventory
ssh_exec "docker ps -a" → Include stopped containers
ssh_exec "docker logs <container>" → Application logs

Step 5: Network Layer

ssh_exec "docker network ls" → Network inventory
ssh_exec "docker network inspect dokploy-network" → Connectivity
dns_list_records → DNS configuration
dns_check_propagation domain="<domain>" → External resolution

Step 6: Application Layer

ssh_exec "curl -I http://localhost:<port>" → Local health
ssh_exec "curl -I https://<domain>" → External health

Diagnostic Output Format

Present findings in structured format:

=== Troubleshooting Report ===
Issue: <user-reported problem>
Timestamp: <when started>

## Quick Checks
- [ ] Host connectivity: OK
- [ ] Traefik status: OK
- [ ] App status: FAILED - container restarting
- [ ] DNS resolution: OK
- [ ] Certificate: OK

## Root Cause
Container `ghost` in restart loop due to missing DATABASE_URL

## Recommended Fix
1. Add DATABASE_URL to Dokploy environment variables
2. Redeploy application
3. Verify health post-deploy

## Commands to Execute
dokploy_redeploy app_id="xxx" confirmed=true

Escalation Path

If quick triage doesn't resolve:

Run full diagnostic chain
Check Grafana dashboards for historical data
Review Loki logs for patterns
Check recent changes (git log, Dokploy activity)
Escalate to manual SSH investigation

Integration with dh-deploy-validator

After fixing issues, trigger dh-deploy-validator agent to confirm resolution and prevent regression.