name: thruk-monitoring-report
description: |
Generate a deterministic HTML monitoring report from Thruk and
send it by email. The agent collects pre-aggregated JSON from the
read-only thruk_* MCP tools — focused on the AVAILABILITY/SLA,
ANALYTICS and PROBLEM-INTELLIGENCE families (thruk-mcp v1.8.0) —
hands each response verbatim to scripts/save.sh, then runs
scripts/render.py (pure-Python, stdlib only) to format the dumps
into a fixed 3-family HTML template, and finally delivers it via
the ews-mcp send_email tool (Exchange Web Services). Use this
skill for any Thruk monitoring digest (Windows / Linux / network /
…); the perimeter is parameterised by hostgroup and/or custom_var
so the same skill serves multiple scopes. Read this skill BEFORE
composing a Thruk report by hand.
whenToUse: |
When a task asks for a scheduled or on-demand Thruk monitoring
digest (SLA/availability + sliding-window analytics + open
problems) restricted to a perimeter (hostgroup and/or custom_vars
KERNEL=…) and delivered as an HTML email. Always prefer this skill
over composing the HTML inline — the renderer is byte-stable across
runs given the same inputs, which is the whole point.
Thruk monitoring report
Deterministic Thruk → HTML email pipeline. The agent only orchestrates
MCP calls and shell commands; aggregation is done by thruk-mcp
server-side (v1.8.0) — the lone exception is a UNION perimeter on the
log-based analytics family, which thruk-mcp cannot OR server-side,
so save.sh --merge unions the two single-leaf responses (see
Perimeter §). HTML formatting is done by a single schema-aware Python
renderer. Final delivery is the ews-mcp send_email tool. Same
inputs ⇒ same report.
This skill is built almost entirely from three read-only tool families:
- Problem intelligence (current state): which problems are open, unacked, old, or silently broken.
- Analytics (sliding window over the log): noise, storms, flapping, recurrence, notifications, reliability (MTTR/MTBF).
- Availability / SLA & performance: uptime % per host/service and metrics about to breach their warn/crit threshold.
Pipeline
LLM ── thruk_* MCP calls ─▶ save.sh ── /tmp/thruk-report/*.json
│
▼
render.py ── report.html
│
▼
cat report.html → ews-mcp send_email
Every analytics / availability / problem-intelligence tool returns an
envelope: {since, until, total_*, results:[...], _warning?}.
save.sh stores the response verbatim; render.py unwraps
results, shows the envelope scalars (counts, window, group_by) as a
one-line caption, and renders _warning(s) as an orange note — nothing
is dropped.
Sections of the report (fixed order, 3 families)
One slot file per row in /tmp/thruk-report/. Order + columns are
locked by the SECTIONS list in scripts/render.py — if you add /
remove / reorder a slot, update BOTH this table and SECTIONS in the
same commit. Missing slots render as "(slot non collecté)", empty
ones as "✓ (aucun)" — never a crash.
Family A — Problem intelligence (current state)
| Slot file | MCP tool | Key params |
|---|---|---|
problem_counts.json |
thruk_problem_counts |
filter |
unacked_critical.json |
thruk_unacked_critical |
filter, threshold_minutes=60 |
oldest_problems.json |
thruk_oldest_problems |
filter, limit=20 |
stale_acks.json |
thruk_stale_acks |
filter, min_days=7 |
stale_checks.json |
thruk_stale_checks |
filter |
Family B — Analytics (sliding window)
| Slot file | MCP tool | Key params |
|---|---|---|
alert_heatmap.json |
thruk_alert_heatmap |
filter, since, bucket |
notification_heatmap.json |
thruk_notification_heatmap |
filter, since, bucket |
noisy_hosts.json |
thruk_top_noisy_hosts |
filter, since, limit=20 |
noisy_services.json |
thruk_top_noisy_services |
filter, since, limit=20 |
recurring_problems.json |
thruk_recurring_problems |
filter, since, min_alerts=5 |
flap_summary.json |
thruk_flap_summary |
filter, since, limit=20 |
concurrent_failures.json |
thruk_concurrent_failures |
filter, since, min_hosts=3 |
notification_summary.json |
thruk_notification_summary |
filter, since, group_by='host' |
reliability_report.json |
thruk_reliability_report |
filter, since='-7d', limit=50 |
Family C — Availability / SLA & performance
| Slot file | MCP tool | Key params |
|---|---|---|
host_availability.json |
thruk_hostgroup_availability |
hostgroup, type='hosts', timeperiod |
service_availability.json |
thruk_hostgroup_availability |
hostgroup, type='services', timeperiod |
perfdata_near_threshold.json |
thruk_perfdata_near_threshold |
filter, within_percent=10, limit |
The two availability slots come from the SAME tool with a different
type=.host_availabilityrows carrytime_up_percent,service_availabilityrows carrytime_ok_percent— the renderer sorts each worst-first and caps to 50 rows. Do not usetype='both'into one slot (mixes the two column sets and the 900+ row payload always spills).
Perimeter — OR where supported, two-call --merge for analytics
thruk-mcp's OR filter support is partial, and the skill MUST
follow it per family. A perimeter that is the UNION of a hostgroup and
a custom_var is handled two different ways:
Families A (problem-intelligence) & C-perfdata → single OR call
These tools accept an OR filter tree and de-duplicate server-side, so
the union is one MCP call (one save.sh <slot>):
{"type":"group","operator":"or","conditions":[
{"type":"leaf","field":"hostgroup","op":"eq","value":"HG_WINDOWS"},
{"type":"leaf","field":"custom_var","op":"eq","value":{"var":"KERNEL","val":"windows"}}
]}
Slots using the OR call: problem_counts, unacked_critical,
oldest_problems, stale_acks, stale_checks,
perfdata_near_threshold.
Family B (log-based analytics) → two single-leaf calls + --merge
The 9 analytics tools reject an OR on hostgroup/custom_var
("Log/alert/notification filters do not support OR on hostgroup or
custom_var — these fields require a secondary /hosts lookup and can
only be AND-combined"). Do NOT send an OR to them and do NOT
substitute a single leg (that would silently shrink the perimeter).
Instead collect each slot in two calls — one per perimeter leg —
and union them client-side with save.sh <slot> --merge:
tool(filter={leaf hostgroup=HG_WINDOWS})→save.sh <slot> --mergetool(filter={leaf custom_var KERNEL=windows})→save.sh <slot> --merge
The first --merge writes verbatim; the second unions into it. The
merge is deduplicated (Option A): rows are keyed by (host, service) (or the bucket/group key); a host present in both legs
reports the SAME window stat, so per-object counters are kept at MAX
(never summed → no double-count), while genuinely additive bucket rows
(heatmaps, notification summary) ARE summed. The dedup/sum policy per
slot lives in scripts/_save_helper.py (MERGE_POLICY).
Single-leaf perimeter (only a hostgroup, or only a custom_var)? Then
each Family-B slot is just one call — still use --merge (it
writes verbatim when the slot does not yet exist), so the procedure is
uniform.
Availability exception:
thruk_hostgroup_availabilitytakes ahostgroupNAME, not a filter tree — it cannot express the custom_var leg of a union. The SLA tables therefore cover the hostgroup only. If the perimeter has custom_vars but no hostgroup, skip the two availability slots (write nothing) and push a warning intometa.warnings.
report_max_objectsguard: the availability backend caps a report at 1000 objects and returns HTTP 500 past it ("too many objects"). For large hostgroups theservice_availability(type='services') leg overflows first. If it 500s, skip that slot, keephost_availability, and push a warning intometa.warnings— do NOT retry with a creative workaround. Ifhost_availabilityspills to afil_*thatexport_fil_to_workdirrefuses (HTTP 400 "file use case not supported"), treat it as non-collectable: skip the slot with a warning rather than halting the whole report.
Working directory
/tmp/thruk-report/ — scratch directory inside the kdust container.
save.sh init wipes its CONTENTS (not the dir) before each run.
If a large MCP response is spilled by the runtime as a Dust fil_*
reference, materialise it first with
export_fil_to_workdir(file_id, dest_path=/tmp/thruk-report/<slot>.raw.json)
then save.sh <slot>.json --from-file /tmp/thruk-report/<slot>.raw.json.
Allowed --from-file roots: /tmp/thruk-report, /tmp/kdust-fil-cache,
/projects, and relative conversation/* paths.
Procedure for the agent
1. Decide the window (ISO-8601 UTC)
- Monday →
since = now − 72 h,period_label='72h'. - Tue–Fri →
since = now − 24 h,period_label='24h'. until = now.- The analytics tools also accept Thruk relative time (
-24h,-72h,-7d) directly insince=— use that; reserve the ISO values formeta+ availabilitytimeperiod. - For availability, prefer the Thruk-native
timeperiodshortcut (last24hours,lastweek) oversince/until. reliability_reportis best over a longer window (since='-7d').
2. Initialise the workdir + meta
run_skill_script(
skill='kdust/thruk-monitoring-report',
command=['scripts/save.sh', 'init'],
stdin=JSON.stringify({
scope_label: 'Windows', # subject + header
hostgroup: 'HG_WINDOWS', # or null
custom_vars: { KERNEL: 'windows' }, # or {}
since: '<ISO UTC>',
until: '<ISO UTC>',
period_label: '24h', # or '72h'
warnings: [] # agent appends caveats here
})
)
3. Collect each slot
Persist every response with save.sh. Families A and C-perfdata use
one OR call per slot; Family B uses two single-leaf calls per
slot with --merge; availability uses the hostgroup name.
scripts/save.sh <slot>.json # A / C : stdin = MCP response
scripts/save.sh <slot>.json --merge # B : per-leg, unions
scripts/save.sh <slot>.json [--merge] --from-file <p> # large response spilled to <p>
Let <OR> = the OR filter tree, <HG> = {leaf hostgroup=HG_WINDOWS},
<CV> = {leaf custom_var KERNEL=windows}.
Family A — one OR call each (no --merge):
thruk_problem_counts(filter=<OR>)thruk_unacked_critical(filter=<OR>, threshold_minutes=60)thruk_oldest_problems(filter=<OR>, limit=20)thruk_stale_acks(filter=<OR>, min_days=7)thruk_stale_checks(filter=<OR>)
Family B — two calls each (<HG> then <CV>), both --merge:
(if the perimeter has a single leg, do that one call only — still --merge)
thruk_alert_heatmap(filter=<HG|CV>, since='-24h', bucket='1h')→alert_heatmap.json --mergethruk_notification_heatmap(filter=<HG|CV>, since='-24h', bucket='1h')→notification_heatmap.json --mergethruk_top_noisy_hosts(filter=<HG|CV>, since='-24h', limit=20)→noisy_hosts.json --mergethruk_top_noisy_services(filter=<HG|CV>, since='-24h', limit=20)→noisy_services.json --mergethruk_recurring_problems(filter=<HG|CV>, since='-24h', min_alerts=5)→recurring_problems.json --mergethruk_flap_summary(filter=<HG|CV>, since='-24h', limit=20)→flap_summary.json --mergethruk_concurrent_failures(filter=<HG|CV>, since='-24h', min_hosts=3)→concurrent_failures.json --mergethruk_notification_summary(filter=<HG|CV>, since='-24h', group_by='host')→notification_summary.json --mergethruk_reliability_report(filter=<HG|CV>, since='-7d', limit=50)→reliability_report.json --merge
Family C — availability (hostgroup name) + perfdata (one OR call):
thruk_hostgroup_availability(hostgroup='HG_WINDOWS', type='hosts', timeperiod='last24hours')→host_availability.jsonthruk_hostgroup_availability(hostgroup='HG_WINDOWS', type='services', timeperiod='last24hours')→service_availability.json(skip + warn on HTTP 500report_max_objects)thruk_perfdata_near_threshold(filter=<OR>, within_percent=10, limit=200)
4. Render the HTML
run_skill_script(
skill='kdust/thruk-monitoring-report',
command=['python3', 'scripts/render.py']
)
Stdout = a one-line summary with row counts per slot (NA = slot not
collected). The HTML lands at /tmp/thruk-report/report.html.
5. Send the email (via ews-mcp)
- Read the HTML:
run_command(cat /tmp/thruk-report/report.html). Typically 30–200 KB — well under the tool body cap. If it ever overflows, lowerMAX_ROWS_PER_SECTION/limit=, never truncate at the agent level. - Build the subject from
meta.json:[Monitoring <scope_label>] Rapport <period_label> — <YYYY-MM-DD> - Call
send_email:
Do not setsend_email( to: ["fsallet@ecritel.net"], subject: "<built above>", body: "<contents of report.html>", body_format: "html", importance: "Normal" )target_mailbox(the bound mailbox is the right From:), do not add a plain-text part, do not attach the HTML as a file — the HTML IS the body. Surface the returnedmessage_idin the run output.
Read-only contract
This skill NEVER calls thruk_acknowledge, thruk_schedule_*,
thruk_recheck, thruk_remove_acknowledgement, thruk_delete_*,
thruk_checks, thruk_notifications, or any other write tool. Keep
it that way — the report task needs read access only.
Caps & gotchas
- Analytics tools cap the log scan at 10000 entries (
_warningin the envelope, rendered as an orange note). Narrowsince=if you hit it. thruk_hostgroup_availability(type='both')returns hosts+services in one shot (900+ rows) and always spills — use twotype=calls.save.sh initwipes the workdir contents;saved_to/ spill paths from previous runs are invalid afterwards.- Determinism: same JSON inputs ⇒ same
report.html(onlymeta.generated_at, defaulted to now, varies — set it explicitly for byte-stable output). If you see drift, the bug is inrender.py; open an issue, do not patch the task prompt.
Failure modes
| Symptom | Cause | Fix |
|---|---|---|
render.py exits missing input: meta.json |
save.sh init not run |
Run init first |
| Section shows "(slot non collecté)" | MCP call skipped | Collect that slot (or leave it — it is optional) |
| Section shows "✓ (aucun)" | Tool returned an empty results |
Nothing to do — healthy perimeter |
save.sh: stdin is not valid JSON |
Streamed a non-JSON blob | Re-collect, or use --from-file for spilled payloads |
Analytics tool rejects OR (do not support OR on hostgroup or custom_var) |
Sent an OR filter to a Family-B log tool | Split into two single-leaf calls (<HG>, <CV>) + save.sh <slot> --merge (Perimeter §) |
save.sh --merge: ... no 'results' list to merge |
Merged a non-envelope / scalar slot | Only Family-B slots take --merge; Family A/C use a single call |
--from-file: path outside allowed roots |
Spill landed elsewhere | Re-export with export_fil_to_workdir to /tmp/thruk-report/ |
send_email auth / mailbox error |
ews-mcp secret rotated / unbound |
Re-bind the Secret in /settings/mcp, retry once |
Extending the perimeter (Linux, network, …)
Scope-agnostic by design. To add a report: create a Task whose prompt
calls this skill with scope_label, hostgroup and/or custom_vars
set for the scope. No skill code change needed. See
references/perimeters.md for Ecritel perimeter conventions.