linux-monitoring-setup - SKILL.md Agent Skill

name: linux-monitoring-setup description: Use when user wants to set up monitoring, observability, alerting, dashboards, uptime checks, metrics collection, or log aggregation for a Linux server, Docker stack, application, or infrastructure — including Prometheus, Grafana, Node Exporter, Loki, Alertmanager, Uptime Kuma, Netdata, or custom bash-based monitoring scripts. version: 1.5.0 author: Lehnert

Linux Monitoring Setup

Overview

Designs and generates a complete monitoring stack tailored to the user's infrastructure and needs. Presents real stack options, lets the user choose, then writes all config files and a ready-to-run docker-compose.yml to disk. Covers metrics, logs, alerts, dashboards, and uptime monitoring.

Language: Respond in the user's language. All config files and comments in English.

When to Use

User wants to monitor a server, VPS, or Docker stack
User asks about Prometheus, Grafana, Node Exporter, Loki, Alertmanager
User wants uptime monitoring or a status page
User wants log aggregation and search (ELK, Loki)
User wants alerts for CPU, disk, memory, service down, HTTP errors
User wants lightweight monitoring without heavy tools

When NOT to Use

User wants to analyze existing logs → linux-log-analyzer
User wants to harden the server → linux-security-hardener
User wants a general Docker stack → docker-compose-writer

Step 1 — Understand the Context

Ask at most two questions if the answers aren't clear from the request.

Question 1: "What do you want to monitor?" — offer examples:

A single Linux server (CPU, RAM, disk, network)
A Docker host and its containers
A web application (HTTP response time, error rates, uptime)
Databases (PostgreSQL, MySQL, Redis)
Everything — full-stack observability

Question 2: "How much infrastructure do you want to run for monitoring?"

Tier	RAM	Containers	Best for
Lightweight	~100–150MB	1–2	Low-resource VPS, quick visibility
Standard	~400–600MB	3–4	Most servers — Prometheus + Grafana
Full	~1–2GB	6–8	Prod infra — metrics + logs + alerts

Lightweight — Netdata or Uptime Kuma or a bash monitoring script
Standard — Prometheus + Grafana + Node Exporter (industry standard)
Full — metrics + logs + traces + alerts (Prometheus + Grafana + Loki + Alertmanager)

If the user just says "set up monitoring for my server", default to Standard — Prometheus + Grafana for a single Linux server.

Step 2 — Present Stack Options

Show options based on the user's context. Always recommend the best fit first.

For a single Linux server (Standard)

Here are the best monitoring stacks for a Linux server. Which fits your needs?

1. Prometheus + Grafana + Node Exporter (Recommended) Industry-standard metrics stack. Node Exporter collects system metrics, Prometheus stores them, Grafana visualizes with pre-built dashboards. ~400MB RAM.

2. Netdata Zero-config, real-time monitoring with a beautiful built-in UI. 1-minute install, very low overhead. Best for quick visibility without configuration. ~150MB RAM.

3. Uptime Kuma Lightweight uptime and status page monitor. Checks HTTP/TCP/ping endpoints and sends alerts (Telegram, Slack, email). ~100MB RAM. Best for service availability.

4. Bash-based monitoring script No containers, no dependencies. A cron-driven bash script that checks CPU, disk, memory, and services — emails or logs alerts. Zero overhead.

For a Docker host

Present: Prometheus + cAdvisor + Grafana (container metrics) or Dozzle (log viewer) or Portainer (full management).

For full-stack observability (Full)

Present: Prometheus + Grafana + Loki + Promtail + Alertmanager (the canonical full stack).

For web application uptime only

Present: Uptime Kuma or Gatus (config-file-driven, more powerful).

Stack Specifications

Stack A — Prometheus + Grafana + Node Exporter (Standard)

Components:

prometheus — scrapes and stores time-series metrics
node_exporter — exposes Linux system metrics on port 9100
grafana — visualization with pre-built Linux dashboards

docker-compose.yml pattern:

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    networks:
      - monitoring
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 10s
      retries: 3

  node_exporter:
    image: prom/node-exporter:v1.8.0
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.4.0
    restart: unless-stopped
    environment:
      GF_SECURITY_ADMIN_USER: ${GRAFANA_USER:-admin}
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_INSTALL_PLUGINS: grafana-clock-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    networks:
      - monitoring
    depends_on:
      prometheus:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:3000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:

prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s      # Increase to 30s if exporters are slow (must be < scrape_interval)

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node_exporter:9100']

  # Add application-specific scrape targets below:
  # - job_name: 'myapp'
  #   static_configs:
  #     - targets: ['myapp:8080']   # must expose /metrics endpoint

Grafana dashboard IDs to import:

1860 — Node Exporter Full (most popular Linux dashboard)
893 — Docker and system monitoring
11074 — Node Exporter for Prometheus Dashboard

Stack B — Full Stack (Prometheus + Grafana + Loki + Promtail + Alertmanager)

Add to Stack A:

Loki — log aggregation, stores logs from Promtail Promtail — log shipper, reads from /var/log and Docker containers Alertmanager — routes Prometheus alerts to Slack, email, PagerDuty

  loki:
    image: grafana/loki:2.9.5
    restart: unless-stopped
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - ./loki/loki-config.yaml:/etc/loki/local-config.yaml:ro
      - loki_data:/loki
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:2.9.5
    restart: unless-stopped
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail/promtail-config.yaml:/etc/promtail/config.yaml:ro
    command: -config.file=/etc/promtail/config.yaml
    networks:
      - monitoring

Always generate `promtail/promtail-config.yaml` when Promtail is included:

```yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          host: ${HOSTNAME}
          __path__: /var/log/*.log

  - job_name: docker
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          host: ${HOSTNAME}
          __path__: /var/lib/docker/containers/*/*log
    pipeline_stages:
      - json:
          expressions:
            log: log
            stream: stream
            time: time
      - labels:
          stream:
      - output:
          source: log

alertmanager: image: prom/alertmanager:v0.27.0 restart: unless-stopped volumes: - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro - alertmanager_data:/alertmanager networks: - monitoring


---

### Stack C — Uptime Kuma (Lightweight uptime monitoring)

```yaml
services:
  uptime-kuma:
    image: louislam/uptime-kuma:1
    restart: unless-stopped
    volumes:
      - uptime_kuma_data:/app/data
    ports:
      - "3001:3001"
    healthcheck:
      test: ["CMD", "node", "extra/healthcheck.js"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  uptime_kuma_data:

Stack D — Netdata (Zero-config real-time monitoring)

services:
  netdata:
    image: netdata/netdata:stable
    restart: unless-stopped
    pid: host
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined
    volumes:
      - netdata_config:/etc/netdata
      - netdata_lib:/var/lib/netdata
      - netdata_cache:/var/cache/netdata
      - /etc/passwd:/host/etc/passwd:ro
      - /etc/group:/host/etc/group:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /etc/os-release:/host/etc/os-release:ro
    environment:
      NETDATA_CLAIM_TOKEN: ${NETDATA_CLAIM_TOKEN:-}
      NETDATA_CLAIM_URL: https://app.netdata.cloud
    ports:
      - "19999:19999"

volumes:
  netdata_config:
  netdata_lib:
  netdata_cache:

Stack E — Docker host monitoring (Prometheus + cAdvisor + Node Exporter + Grafana)

Add cAdvisor to Stack A for container metrics:

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
    networks:
      - monitoring

Add to prometheus.yml:

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

Alert Rules

When generating full stack, always include a prometheus/alert.rules.yml with these critical alerts:

groups:
  - name: critical
    rules:
      - alert: HostHighCpuLoad
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load on {{ $labels.instance }}"

      - alert: HostLowDisk
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}: {{ $value | printf \"%.1f\" }}% free"

      - alert: HostOutOfMemory
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Out of memory on {{ $labels.instance }}"

      - alert: HostDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is down"

      - alert: HostHighLoad
        expr: node_load15 > 4
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High system load on {{ $labels.instance }}"

Alertmanager Configuration

When Alertmanager is included, always generate alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s          # Wait to batch related alerts before sending
  group_interval: 5m       # How often to resend grouped alerts
  repeat_interval: 4h      # Repeat firing alerts every 4 hours

  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'critical'
      repeat_interval: 1h   # Critical alerts: repeat every hour

receivers:
  - name: 'default'
    # Replace with your notification method:
    # slack_configs, email_configs, pagerduty_configs
    # See: https://prometheus.io/docs/alerting/configuration/

  - name: 'critical'
    # Same — configure separately for critical escalation

inhibit_rules:
  # Suppress warnings when a critical alert for the same instance is already firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['instance']

Alert fatigue prevention: The inhibit_rules section suppresses warning-level alerts when a critical alert for the same instance is already firing. Without this, a disk-full event can trigger 10+ separate alerts simultaneously.

Output Format

Write all files silently to ./monitoring/ in the current working directory.

Structure:

monitoring/
  docker-compose.yml
  .env.example
  prometheus/
    prometheus.yml
    alert.rules.yml          (if alerting configured)
  grafana/
    provisioning/
      datasources/
        prometheus.yml
      dashboards/
        dashboard.yml
  loki/
    loki-config.yaml         (if Loki included)
  promtail/
    promtail-config.yaml     (if Promtail included)
  alertmanager/
    alertmanager.yml         (if Alertmanager included)

Then print ONLY:

✅ Monitoring stack created in ./monitoring/

▶ To start:
  cd monitoring
  cp .env.example .env && nano .env
  docker compose up -d

▶ Verify everything is running:
  docker compose ps
  docker compose logs -f

🌐 Grafana:      http://localhost:3000  (admin / your-password)
🌐 Prometheus:   http://localhost:9090
🌐 Alertmanager: http://localhost:9093  (if configured)

📊 Import dashboards in Grafana:
   Dashboards → Import → Enter ID: 1860 (Node Exporter Full)
   Dashboards → Import → Enter ID: 893  (Docker monitoring)

💡 Next: /linux-security-hardener to lock down the monitoring ports before exposing to internet

Adjust URLs to match the actual services generated.