linux-monitoring-setup

star 1

Use when user wants to set up monitoring, observability, alerting, dashboards, uptime checks, metrics collection, or log aggregation for a Linux server, Docker stack, application, or infrastructure — including Prometheus, Grafana, Node Exporter, Loki, Alertmanager, Uptime Kuma, Netdata, or custom bash-based monitoring scripts.

chfle By chfle schedule Updated 3/2/2026

name: linux-monitoring-setup description: Use when user wants to set up monitoring, observability, alerting, dashboards, uptime checks, metrics collection, or log aggregation for a Linux server, Docker stack, application, or infrastructure — including Prometheus, Grafana, Node Exporter, Loki, Alertmanager, Uptime Kuma, Netdata, or custom bash-based monitoring scripts. version: 1.5.0 author: Lehnert

Linux Monitoring Setup

Overview

Designs and generates a complete monitoring stack tailored to the user's infrastructure and needs. Presents real stack options, lets the user choose, then writes all config files and a ready-to-run docker-compose.yml to disk. Covers metrics, logs, alerts, dashboards, and uptime monitoring.

Language: Respond in the user's language. All config files and comments in English.


When to Use

  • User wants to monitor a server, VPS, or Docker stack
  • User asks about Prometheus, Grafana, Node Exporter, Loki, Alertmanager
  • User wants uptime monitoring or a status page
  • User wants log aggregation and search (ELK, Loki)
  • User wants alerts for CPU, disk, memory, service down, HTTP errors
  • User wants lightweight monitoring without heavy tools

When NOT to Use

  • User wants to analyze existing logs → linux-log-analyzer
  • User wants to harden the server → linux-security-hardener
  • User wants a general Docker stack → docker-compose-writer

Step 1 — Understand the Context

Ask at most two questions if the answers aren't clear from the request.

Question 1: "What do you want to monitor?" — offer examples:

  • A single Linux server (CPU, RAM, disk, network)
  • A Docker host and its containers
  • A web application (HTTP response time, error rates, uptime)
  • Databases (PostgreSQL, MySQL, Redis)
  • Everything — full-stack observability

Question 2: "How much infrastructure do you want to run for monitoring?"

Tier RAM Containers Best for
Lightweight ~100–150MB 1–2 Low-resource VPS, quick visibility
Standard ~400–600MB 3–4 Most servers — Prometheus + Grafana
Full ~1–2GB 6–8 Prod infra — metrics + logs + alerts
  • Lightweight — Netdata or Uptime Kuma or a bash monitoring script
  • Standard — Prometheus + Grafana + Node Exporter (industry standard)
  • Full — metrics + logs + traces + alerts (Prometheus + Grafana + Loki + Alertmanager)

If the user just says "set up monitoring for my server", default to Standard — Prometheus + Grafana for a single Linux server.


Step 2 — Present Stack Options

Show options based on the user's context. Always recommend the best fit first.

For a single Linux server (Standard)

Here are the best monitoring stacks for a Linux server. Which fits your needs?

1. Prometheus + Grafana + Node Exporter (Recommended) Industry-standard metrics stack. Node Exporter collects system metrics, Prometheus stores them, Grafana visualizes with pre-built dashboards. ~400MB RAM.

2. Netdata Zero-config, real-time monitoring with a beautiful built-in UI. 1-minute install, very low overhead. Best for quick visibility without configuration. ~150MB RAM.

3. Uptime Kuma Lightweight uptime and status page monitor. Checks HTTP/TCP/ping endpoints and sends alerts (Telegram, Slack, email). ~100MB RAM. Best for service availability.

4. Bash-based monitoring script No containers, no dependencies. A cron-driven bash script that checks CPU, disk, memory, and services — emails or logs alerts. Zero overhead.

For a Docker host

Present: Prometheus + cAdvisor + Grafana (container metrics) or Dozzle (log viewer) or Portainer (full management).

For full-stack observability (Full)

Present: Prometheus + Grafana + Loki + Promtail + Alertmanager (the canonical full stack).

For web application uptime only

Present: Uptime Kuma or Gatus (config-file-driven, more powerful).


Stack Specifications

Stack A — Prometheus + Grafana + Node Exporter (Standard)

Components:

  • prometheus — scrapes and stores time-series metrics
  • node_exporter — exposes Linux system metrics on port 9100
  • grafana — visualization with pre-built Linux dashboards

docker-compose.yml pattern:

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    networks:
      - monitoring
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 10s
      retries: 3

  node_exporter:
    image: prom/node-exporter:v1.8.0
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.4.0
    restart: unless-stopped
    environment:
      GF_SECURITY_ADMIN_USER: ${GRAFANA_USER:-admin}
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_INSTALL_PLUGINS: grafana-clock-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    networks:
      - monitoring
    depends_on:
      prometheus:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:3000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:

prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s      # Increase to 30s if exporters are slow (must be < scrape_interval)

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node_exporter:9100']

  # Add application-specific scrape targets below:
  # - job_name: 'myapp'
  #   static_configs:
  #     - targets: ['myapp:8080']   # must expose /metrics endpoint

Grafana dashboard IDs to import:

  • 1860 — Node Exporter Full (most popular Linux dashboard)
  • 893 — Docker and system monitoring
  • 11074 — Node Exporter for Prometheus Dashboard

Stack B — Full Stack (Prometheus + Grafana + Loki + Promtail + Alertmanager)

Add to Stack A:

Loki — log aggregation, stores logs from Promtail Promtail — log shipper, reads from /var/log and Docker containers Alertmanager — routes Prometheus alerts to Slack, email, PagerDuty

  loki:
    image: grafana/loki:2.9.5
    restart: unless-stopped
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - ./loki/loki-config.yaml:/etc/loki/local-config.yaml:ro
      - loki_data:/loki
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:2.9.5
    restart: unless-stopped
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail/promtail-config.yaml:/etc/promtail/config.yaml:ro
    command: -config.file=/etc/promtail/config.yaml
    networks:
      - monitoring

Always generate `promtail/promtail-config.yaml` when Promtail is included:

```yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          host: ${HOSTNAME}
          __path__: /var/log/*.log

  - job_name: docker
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          host: ${HOSTNAME}
          __path__: /var/lib/docker/containers/*/*log
    pipeline_stages:
      - json:
          expressions:
            log: log
            stream: stream
            time: time
      - labels:
          stream:
      - output:
          source: log

alertmanager: image: prom/alertmanager:v0.27.0 restart: unless-stopped volumes: - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro - alertmanager_data:/alertmanager networks: - monitoring


---

### Stack C — Uptime Kuma (Lightweight uptime monitoring)

```yaml
services:
  uptime-kuma:
    image: louislam/uptime-kuma:1
    restart: unless-stopped
    volumes:
      - uptime_kuma_data:/app/data
    ports:
      - "3001:3001"
    healthcheck:
      test: ["CMD", "node", "extra/healthcheck.js"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  uptime_kuma_data:

Stack D — Netdata (Zero-config real-time monitoring)

services:
  netdata:
    image: netdata/netdata:stable
    restart: unless-stopped
    pid: host
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined
    volumes:
      - netdata_config:/etc/netdata
      - netdata_lib:/var/lib/netdata
      - netdata_cache:/var/cache/netdata
      - /etc/passwd:/host/etc/passwd:ro
      - /etc/group:/host/etc/group:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /etc/os-release:/host/etc/os-release:ro
    environment:
      NETDATA_CLAIM_TOKEN: ${NETDATA_CLAIM_TOKEN:-}
      NETDATA_CLAIM_URL: https://app.netdata.cloud
    ports:
      - "19999:19999"

volumes:
  netdata_config:
  netdata_lib:
  netdata_cache:

Stack E — Docker host monitoring (Prometheus + cAdvisor + Node Exporter + Grafana)

Add cAdvisor to Stack A for container metrics:

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
    networks:
      - monitoring

Add to prometheus.yml:

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

Alert Rules

When generating full stack, always include a prometheus/alert.rules.yml with these critical alerts:

groups:
  - name: critical
    rules:
      - alert: HostHighCpuLoad
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load on {{ $labels.instance }}"

      - alert: HostLowDisk
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}: {{ $value | printf \"%.1f\" }}% free"

      - alert: HostOutOfMemory
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Out of memory on {{ $labels.instance }}"

      - alert: HostDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is down"

      - alert: HostHighLoad
        expr: node_load15 > 4
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High system load on {{ $labels.instance }}"

Alertmanager Configuration

When Alertmanager is included, always generate alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s          # Wait to batch related alerts before sending
  group_interval: 5m       # How often to resend grouped alerts
  repeat_interval: 4h      # Repeat firing alerts every 4 hours

  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'critical'
      repeat_interval: 1h   # Critical alerts: repeat every hour

receivers:
  - name: 'default'
    # Replace with your notification method:
    # slack_configs, email_configs, pagerduty_configs
    # See: https://prometheus.io/docs/alerting/configuration/

  - name: 'critical'
    # Same — configure separately for critical escalation

inhibit_rules:
  # Suppress warnings when a critical alert for the same instance is already firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['instance']

Alert fatigue prevention: The inhibit_rules section suppresses warning-level alerts when a critical alert for the same instance is already firing. Without this, a disk-full event can trigger 10+ separate alerts simultaneously.


Output Format

Write all files silently to ./monitoring/ in the current working directory.

Structure:

monitoring/
  docker-compose.yml
  .env.example
  prometheus/
    prometheus.yml
    alert.rules.yml          (if alerting configured)
  grafana/
    provisioning/
      datasources/
        prometheus.yml
      dashboards/
        dashboard.yml
  loki/
    loki-config.yaml         (if Loki included)
  promtail/
    promtail-config.yaml     (if Promtail included)
  alertmanager/
    alertmanager.yml         (if Alertmanager included)

Then print ONLY:

✅ Monitoring stack created in ./monitoring/

▶ To start:
  cd monitoring
  cp .env.example .env && nano .env
  docker compose up -d

▶ Verify everything is running:
  docker compose ps
  docker compose logs -f

🌐 Grafana:      http://localhost:3000  (admin / your-password)
🌐 Prometheus:   http://localhost:9090
🌐 Alertmanager: http://localhost:9093  (if configured)

📊 Import dashboards in Grafana:
   Dashboards → Import → Enter ID: 1860 (Node Exporter Full)
   Dashboards → Import → Enter ID: 893  (Docker monitoring)

💡 Next: /linux-security-hardener to lock down the monitoring ports before exposing to internet

Adjust URLs to match the actual services generated.

Install via CLI
npx skills add https://github.com/chfle/lehnert-claude-skills --skill linux-monitoring-setup
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator