nomad-scheduler - SKILL.md Agent Skill

name: nomad-scheduler description: HashiCorp Nomad — job scheduling, task drivers, allocations, scaling, federation domain: devops tags:

ci-cd
devops
infrastructure
nomad
scheduler

Overview

Nomad is a flexible workload orchestrator for deploying containers, VMs, and standalone applications. Supports multiple task drivers, bin packing, and multi-region federation.

Capabilities

Multi-driver task scheduling (Docker, exec, Java, QEMU)
Job specification in HCL
Rolling updates and canary deployments
Multi-region federation
Autoscaling with external metrics
Integration with Consul and Vault

When to Use

Mixed workloads (containers + non-containerized)
Simple alternative to Kubernetes
Multi-region deployments
Batch job scheduling
Integration with HashiCorp stack (Consul, Vault)

When NOT to Use

Task is outside your authorization scope
You need to implement controls (use implementing-* skills)
Task is about analysis, not action (use analyzing-* skills)
You don't have access to target systems
Task requires compliance expertise (consult professionals)
Task is about defense, not offense (use defensive skills)

Pseudo Code

The nomad-scheduler workflow follows a standard pipeline pattern.

Core flow:

# nomad-scheduler primary flow
input = prepare(raw_data)
result = process(input, config={allocations, drivers, federation, hashicorp, nomad})
validate(result)
deliver(result)

Error handling:

on error:
  log(error_details)
  retry_with_backoff(max=3)
  if still_failing: alert_and_escalate()

Job Specification

# web.nomad
job "web" {
  datacenters = ["dc1"]
  type        = "service"

  group "app" {
    count = 3

    network {
      port "http" { to = 8080 }
    }

    service {
      name     = "web"
      port     = "http"
      provider = "consul"

      check {
        type     = "http"
        path     = "/health"
        interval = "10s"
      }
    }

    task "server" {
      driver = "docker"

      config {
        image = "nginx:alpine"
        ports = ["http"]
      }

      resources {
        cpu    = 256
        memory = 256
      }

      template {
        data        = "{{ key \"config/nginx.conf\" }}"
        destination = "local/nginx.conf"
      }

      vault {
        policies = ["web-policy"]
      }
    }
  }

  update {
    max_parallel = 1
    canary       = 1
    auto_revert  = true
  }
}

Commands

# Run job
nomad job run web.nomad

# Status
nomad job status web
nomad alloc status <alloc-id>

# Scale
nomad job scale web app 5

# Stop
nomad job stop web

# Plan (dry run)
nomad job plan web.nomad

# Regions
nomad server members
nomad job run -region=eu web.nomad

Common Patterns

Periodic jobs: cron schedule for batch work
Parameterized jobs: trigger with meta payload
Canary deploys: canary = 1 for safe rollouts
Spread: distribute across datacenters
Affinity/Constraints: pin to specific nodes

How to Use

Define infrastructure as code (Terraform, CloudFormation, Pulumi)
Review changes through PR process before applying
Configure monitoring and alerting for critical paths
Set up secrets management (Vault, AWS Secrets Manager, etc.)
Document runbooks for deployment, rollback, and incident response
Test disaster recovery procedures regularly

Red Flags

Infrastructure changes without review: Unreviewed changes cause outages — use PRs for infra code
No rollback strategy: Every deployment needs a tested rollback plan before it runs
Secrets in configuration files: Secrets in YAML/JSON get committed to version control
Missing monitoring and alerting: Without monitoring, outages go undetected until users report them
No documentation for runbooks: Without runbooks, on-call engineers waste time re-discovering procedures

Verification

Skill output matches expected behavior

Process

Analyze the task requirements
Apply domain expertise
Verify output quality