thruk-monitoring-report

star 0

Generate a deterministic HTML monitoring report from Thruk and send it by email. The agent collects pre-aggregated JSON from the read-only `thruk_*` MCP tools — focused on the AVAILABILITY/SLA, ANALYTICS and PROBLEM-INTELLIGENCE families (thruk-mcp v1.8.0) — hands each response verbatim to `scripts/save.sh`, then runs `scripts/render.py` (pure-Python, stdlib only) to format the dumps into a fixed 3-family HTML template, and finally delivers it via the `ews-mcp` `send_email` tool (Exchange Web Services). Use this skill for any Thruk monitoring digest (Windows / Linux / network / …); the perimeter is parameterised by hostgroup and/or custom_var so the same skill serves multiple scopes. Read this skill BEFORE composing a Thruk report by hand.

k9fr4n By k9fr4n schedule Updated 6/1/2026

name: thruk-monitoring-report description: | Generate a deterministic HTML monitoring report from Thruk and send it by email. The agent collects pre-aggregated JSON from the read-only thruk_* MCP tools — focused on the AVAILABILITY/SLA, ANALYTICS and PROBLEM-INTELLIGENCE families (thruk-mcp v1.8.0) — hands each response verbatim to scripts/save.sh, then runs scripts/render.py (pure-Python, stdlib only) to format the dumps into a fixed 3-family HTML template, and finally delivers it via the ews-mcp send_email tool (Exchange Web Services). Use this skill for any Thruk monitoring digest (Windows / Linux / network / …); the perimeter is parameterised by hostgroup and/or custom_var so the same skill serves multiple scopes. Read this skill BEFORE composing a Thruk report by hand. whenToUse: | When a task asks for a scheduled or on-demand Thruk monitoring digest (SLA/availability + sliding-window analytics + open problems) restricted to a perimeter (hostgroup and/or custom_vars KERNEL=…) and delivered as an HTML email. Always prefer this skill over composing the HTML inline — the renderer is byte-stable across runs given the same inputs, which is the whole point.

Thruk monitoring report

Deterministic Thruk → HTML email pipeline. The agent only orchestrates MCP calls and shell commands; aggregation is done by thruk-mcp server-side (v1.8.0) — the lone exception is a UNION perimeter on the log-based analytics family, which thruk-mcp cannot OR server-side, so save.sh --merge unions the two single-leaf responses (see Perimeter §). HTML formatting is done by a single schema-aware Python renderer. Final delivery is the ews-mcp send_email tool. Same inputs ⇒ same report.

This skill is built almost entirely from three read-only tool families:

  • Problem intelligence (current state): which problems are open, unacked, old, or silently broken.
  • Analytics (sliding window over the log): noise, storms, flapping, recurrence, notifications, reliability (MTTR/MTBF).
  • Availability / SLA & performance: uptime % per host/service and metrics about to breach their warn/crit threshold.

Pipeline

LLM ── thruk_* MCP calls ─▶ save.sh ── /tmp/thruk-report/*.json
                                            │
                                            ▼
                                       render.py ── report.html
                                            │
                                            ▼
                              cat report.html → ews-mcp send_email

Every analytics / availability / problem-intelligence tool returns an envelope: {since, until, total_*, results:[...], _warning?}. save.sh stores the response verbatim; render.py unwraps results, shows the envelope scalars (counts, window, group_by) as a one-line caption, and renders _warning(s) as an orange note — nothing is dropped.

Sections of the report (fixed order, 3 families)

One slot file per row in /tmp/thruk-report/. Order + columns are locked by the SECTIONS list in scripts/render.py — if you add / remove / reorder a slot, update BOTH this table and SECTIONS in the same commit. Missing slots render as "(slot non collecté)", empty ones as "✓ (aucun)" — never a crash.

Family A — Problem intelligence (current state)

Slot file MCP tool Key params
problem_counts.json thruk_problem_counts filter
unacked_critical.json thruk_unacked_critical filter, threshold_minutes=60
oldest_problems.json thruk_oldest_problems filter, limit=20
stale_acks.json thruk_stale_acks filter, min_days=7
stale_checks.json thruk_stale_checks filter

Family B — Analytics (sliding window)

Slot file MCP tool Key params
alert_heatmap.json thruk_alert_heatmap filter, since, bucket
notification_heatmap.json thruk_notification_heatmap filter, since, bucket
noisy_hosts.json thruk_top_noisy_hosts filter, since, limit=20
noisy_services.json thruk_top_noisy_services filter, since, limit=20
recurring_problems.json thruk_recurring_problems filter, since, min_alerts=5
flap_summary.json thruk_flap_summary filter, since, limit=20
concurrent_failures.json thruk_concurrent_failures filter, since, min_hosts=3
notification_summary.json thruk_notification_summary filter, since, group_by='host'
reliability_report.json thruk_reliability_report filter, since='-7d', limit=50

Family C — Availability / SLA & performance

Slot file MCP tool Key params
host_availability.json thruk_hostgroup_availability hostgroup, type='hosts', timeperiod
service_availability.json thruk_hostgroup_availability hostgroup, type='services', timeperiod
perfdata_near_threshold.json thruk_perfdata_near_threshold filter, within_percent=10, limit

The two availability slots come from the SAME tool with a different type=. host_availability rows carry time_up_percent, service_availability rows carry time_ok_percent — the renderer sorts each worst-first and caps to 50 rows. Do not use type='both' into one slot (mixes the two column sets and the 900+ row payload always spills).

Perimeter — OR where supported, two-call --merge for analytics

thruk-mcp's OR filter support is partial, and the skill MUST follow it per family. A perimeter that is the UNION of a hostgroup and a custom_var is handled two different ways:

Families A (problem-intelligence) & C-perfdata → single OR call

These tools accept an OR filter tree and de-duplicate server-side, so the union is one MCP call (one save.sh <slot>):

{"type":"group","operator":"or","conditions":[
  {"type":"leaf","field":"hostgroup","op":"eq","value":"HG_WINDOWS"},
  {"type":"leaf","field":"custom_var","op":"eq","value":{"var":"KERNEL","val":"windows"}}
]}

Slots using the OR call: problem_counts, unacked_critical, oldest_problems, stale_acks, stale_checks, perfdata_near_threshold.

Family B (log-based analytics) → two single-leaf calls + --merge

The 9 analytics tools reject an OR on hostgroup/custom_var ("Log/alert/notification filters do not support OR on hostgroup or custom_var — these fields require a secondary /hosts lookup and can only be AND-combined"). Do NOT send an OR to them and do NOT substitute a single leg (that would silently shrink the perimeter).

Instead collect each slot in two calls — one per perimeter leg — and union them client-side with save.sh <slot> --merge:

  1. tool(filter={leaf hostgroup=HG_WINDOWS})save.sh <slot> --merge
  2. tool(filter={leaf custom_var KERNEL=windows})save.sh <slot> --merge

The first --merge writes verbatim; the second unions into it. The merge is deduplicated (Option A): rows are keyed by (host, service) (or the bucket/group key); a host present in both legs reports the SAME window stat, so per-object counters are kept at MAX (never summed → no double-count), while genuinely additive bucket rows (heatmaps, notification summary) ARE summed. The dedup/sum policy per slot lives in scripts/_save_helper.py (MERGE_POLICY).

Single-leaf perimeter (only a hostgroup, or only a custom_var)? Then each Family-B slot is just one call — still use --merge (it writes verbatim when the slot does not yet exist), so the procedure is uniform.

Availability exception: thruk_hostgroup_availability takes a hostgroup NAME, not a filter tree — it cannot express the custom_var leg of a union. The SLA tables therefore cover the hostgroup only. If the perimeter has custom_vars but no hostgroup, skip the two availability slots (write nothing) and push a warning into meta.warnings.

report_max_objects guard: the availability backend caps a report at 1000 objects and returns HTTP 500 past it ("too many objects"). For large hostgroups the service_availability (type='services') leg overflows first. If it 500s, skip that slot, keep host_availability, and push a warning into meta.warnings — do NOT retry with a creative workaround. If host_availability spills to a fil_* that export_fil_to_workdir refuses (HTTP 400 "file use case not supported"), treat it as non-collectable: skip the slot with a warning rather than halting the whole report.

Working directory

/tmp/thruk-report/ — scratch directory inside the kdust container. save.sh init wipes its CONTENTS (not the dir) before each run.

If a large MCP response is spilled by the runtime as a Dust fil_* reference, materialise it first with export_fil_to_workdir(file_id, dest_path=/tmp/thruk-report/<slot>.raw.json) then save.sh <slot>.json --from-file /tmp/thruk-report/<slot>.raw.json. Allowed --from-file roots: /tmp/thruk-report, /tmp/kdust-fil-cache, /projects, and relative conversation/* paths.

Procedure for the agent

1. Decide the window (ISO-8601 UTC)

  • Monday → since = now − 72 h, period_label='72h'.
  • Tue–Fri → since = now − 24 h, period_label='24h'.
  • until = now.
  • The analytics tools also accept Thruk relative time (-24h, -72h, -7d) directly in since= — use that; reserve the ISO values for meta + availability timeperiod.
  • For availability, prefer the Thruk-native timeperiod shortcut (last24hours, lastweek) over since/until.
  • reliability_report is best over a longer window (since='-7d').

2. Initialise the workdir + meta

run_skill_script(
  skill='kdust/thruk-monitoring-report',
  command=['scripts/save.sh', 'init'],
  stdin=JSON.stringify({
    scope_label:  'Windows',             # subject + header
    hostgroup:    'HG_WINDOWS',          # or null
    custom_vars:  { KERNEL: 'windows' }, # or {}
    since:        '<ISO UTC>',
    until:        '<ISO UTC>',
    period_label: '24h',                 # or '72h'
    warnings:     []                     # agent appends caveats here
  })
)

3. Collect each slot

Persist every response with save.sh. Families A and C-perfdata use one OR call per slot; Family B uses two single-leaf calls per slot with --merge; availability uses the hostgroup name.

scripts/save.sh <slot>.json                       # A / C : stdin = MCP response
scripts/save.sh <slot>.json --merge               # B     : per-leg, unions
scripts/save.sh <slot>.json [--merge] --from-file <p>  # large response spilled to <p>

Let <OR> = the OR filter tree, <HG> = {leaf hostgroup=HG_WINDOWS}, <CV> = {leaf custom_var KERNEL=windows}.

Family A — one OR call each (no --merge):

  • thruk_problem_counts(filter=<OR>)
  • thruk_unacked_critical(filter=<OR>, threshold_minutes=60)
  • thruk_oldest_problems(filter=<OR>, limit=20)
  • thruk_stale_acks(filter=<OR>, min_days=7)
  • thruk_stale_checks(filter=<OR>)

Family B — two calls each (<HG> then <CV>), both --merge: (if the perimeter has a single leg, do that one call only — still --merge)

  • thruk_alert_heatmap(filter=<HG|CV>, since='-24h', bucket='1h')alert_heatmap.json --merge
  • thruk_notification_heatmap(filter=<HG|CV>, since='-24h', bucket='1h')notification_heatmap.json --merge
  • thruk_top_noisy_hosts(filter=<HG|CV>, since='-24h', limit=20)noisy_hosts.json --merge
  • thruk_top_noisy_services(filter=<HG|CV>, since='-24h', limit=20)noisy_services.json --merge
  • thruk_recurring_problems(filter=<HG|CV>, since='-24h', min_alerts=5)recurring_problems.json --merge
  • thruk_flap_summary(filter=<HG|CV>, since='-24h', limit=20)flap_summary.json --merge
  • thruk_concurrent_failures(filter=<HG|CV>, since='-24h', min_hosts=3)concurrent_failures.json --merge
  • thruk_notification_summary(filter=<HG|CV>, since='-24h', group_by='host')notification_summary.json --merge
  • thruk_reliability_report(filter=<HG|CV>, since='-7d', limit=50)reliability_report.json --merge

Family C — availability (hostgroup name) + perfdata (one OR call):

  • thruk_hostgroup_availability(hostgroup='HG_WINDOWS', type='hosts', timeperiod='last24hours')host_availability.json
  • thruk_hostgroup_availability(hostgroup='HG_WINDOWS', type='services', timeperiod='last24hours')service_availability.json (skip + warn on HTTP 500 report_max_objects)
  • thruk_perfdata_near_threshold(filter=<OR>, within_percent=10, limit=200)

4. Render the HTML

run_skill_script(
  skill='kdust/thruk-monitoring-report',
  command=['python3', 'scripts/render.py']
)

Stdout = a one-line summary with row counts per slot (NA = slot not collected). The HTML lands at /tmp/thruk-report/report.html.

5. Send the email (via ews-mcp)

  1. Read the HTML: run_command(cat /tmp/thruk-report/report.html). Typically 30–200 KB — well under the tool body cap. If it ever overflows, lower MAX_ROWS_PER_SECTION / limit=, never truncate at the agent level.
  2. Build the subject from meta.json: [Monitoring <scope_label>] Rapport <period_label> — <YYYY-MM-DD>
  3. Call send_email:
    send_email(
      to:          ["fsallet@ecritel.net"],
      subject:     "<built above>",
      body:        "<contents of report.html>",
      body_format: "html",
      importance:  "Normal"
    )
    
    Do not set target_mailbox (the bound mailbox is the right From:), do not add a plain-text part, do not attach the HTML as a file — the HTML IS the body. Surface the returned message_id in the run output.

Read-only contract

This skill NEVER calls thruk_acknowledge, thruk_schedule_*, thruk_recheck, thruk_remove_acknowledgement, thruk_delete_*, thruk_checks, thruk_notifications, or any other write tool. Keep it that way — the report task needs read access only.

Caps & gotchas

  • Analytics tools cap the log scan at 10000 entries (_warning in the envelope, rendered as an orange note). Narrow since= if you hit it.
  • thruk_hostgroup_availability(type='both') returns hosts+services in one shot (900+ rows) and always spills — use two type= calls.
  • save.sh init wipes the workdir contents; saved_to / spill paths from previous runs are invalid afterwards.
  • Determinism: same JSON inputs ⇒ same report.html (only meta.generated_at, defaulted to now, varies — set it explicitly for byte-stable output). If you see drift, the bug is in render.py; open an issue, do not patch the task prompt.

Failure modes

Symptom Cause Fix
render.py exits missing input: meta.json save.sh init not run Run init first
Section shows "(slot non collecté)" MCP call skipped Collect that slot (or leave it — it is optional)
Section shows "✓ (aucun)" Tool returned an empty results Nothing to do — healthy perimeter
save.sh: stdin is not valid JSON Streamed a non-JSON blob Re-collect, or use --from-file for spilled payloads
Analytics tool rejects OR (do not support OR on hostgroup or custom_var) Sent an OR filter to a Family-B log tool Split into two single-leaf calls (<HG>, <CV>) + save.sh <slot> --merge (Perimeter §)
save.sh --merge: ... no 'results' list to merge Merged a non-envelope / scalar slot Only Family-B slots take --merge; Family A/C use a single call
--from-file: path outside allowed roots Spill landed elsewhere Re-export with export_fil_to_workdir to /tmp/thruk-report/
send_email auth / mailbox error ews-mcp secret rotated / unbound Re-bind the Secret in /settings/mcp, retry once

Extending the perimeter (Linux, network, …)

Scope-agnostic by design. To add a report: create a Task whose prompt calls this skill with scope_label, hostgroup and/or custom_vars set for the scope. No skill code change needed. See references/perimeters.md for Ecritel perimeter conventions.

Install via CLI
npx skills add https://github.com/k9fr4n/KDust --skill thruk-monitoring-report
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator