name: dashboards-as-code
description: |
Use this skill when building, modifying, reviewing, or pushing Grafana dashboards under packages/grafana-dashboards/ (Materialize observability dashboards generated from Python via grafana-foundation-sdk and py-mzmon-lib). Also use it when writing panel descriptions for those dashboards, picking palettes, or working through Materialize-specific PromQL patterns (cluster/replica filtering, peek latency, source/sink metrics, label-family quirks).
Dashboards as Code
This skill is the entry point for the Materialize dashboards-as-code project. Stable conventions live in the repo docsite under docs/content/reference/internal/dashboard/ — this file is intentionally slim and links into the docsite at heading-level granularity. The non-link content below is the state snapshot: what currently exists, what's in flight, and what's queued for cleanup.
Audience reminder
The dashboards themselves target Materialize end users: database-literate operators with basic graph-reading fluency but minimal cloud / Kubernetes / observability expertise. SQL is fair game; jargon like "differential dataflow's arrangement" needs a one-liner explanation. Panel descriptions, titles, and cluster names should respect that baseline.
The docsite reference pages target repo contributors (SRE, Field Engineering, CloudOps, Database Engineers) and AI agents reading this skill.
Where to find what
| Looking for… | Read |
|---|---|
| Grafana target versions, Dashboard v1/v2 schema state, SDK choices | SDKs and Schemas |
Code structure, UID conventions, push process, gcx dashboards update vs ad-hoc v2 API |
Generating and Pushing Dashboards |
| Palettes, layouts, panel visualization, panel description voice, PromQL conventions, label families, metric quirks, PromQL recipes, module-level constants table | Style Guidelines |
| Testing conventions (currently sparse) | Testing |
Frequently needed deep links into the Style Guidelines:
- Tab-level theming
- Multi-select variables in regex contexts
- Sparkline stats
- Partitioned sparkline stats
- Writing panel descriptions
- Filtering by cluster / replica
- Materialize metric label families
- Known metric quirks and gotchas
- PromQL recipes
- Shared module-level constants and helpers
And into Generating:
- PUT body shape — required Kubernetes-style envelope when pushing v2 dashboards via
grafana_api_request - Service account permissions — decoding 403s
Schema reference files
When uncertain about the exact shape Grafana expects, the cog-generated openapi schemas are bundled here:
references/dashboard.openapi.json— v1references/dashboardv2beta1.openapi.json— v2beta1references/dashboardv2.openapi.json— v2
All three generated from cog 61ff0a6055fa48f0c7b105fe4a37af637191314f (April 9, 2026).
Current Dashboard State
This section captures the live state of the dashboards in this repo so the next session has something concrete to start from. Update it when state changes meaningfully (new dashboard, new tab, retired panel, theme reassignment).
Dashboard inventory
| Family | Dashboard module | Class | Live UID |
|---|---|---|---|
mz_environment |
overview.overview_dashboard |
EnvironmentOverviewDashboard |
(auto-assigned at first upload; codified UID is mz-mon-env-top, but the live one diverged before that became authoritative — see UID selection and behavior) |
The mz_environment/overview dashboard has six tabs, in declared order:
| # | Tab title | Module | Theme |
|---|---|---|---|
| 1 | Summary | summary.py |
(no unique theme; uses health palette and themes from imports) |
| 2 | Kubernetes Workloads | k8s_resources.py |
K8S_THEME = palette.THEME_PALETTE[0] (blue) |
| 3 | Cluster Objects / Replicas | cluster_objects.py |
CLUSTERS_THEME = palette.THEME_PALETTE[2] (teal) |
| 4 | Connections / Activity | connections_activity.py |
CONNECTIONS_THEME = palette.THEME_PALETTE[1] (cyan) |
| 5 | Compute Objects | compute_objects.py |
COMPUTE_THEME = palette.THEME_PALETTE[3] (orange) |
| 6 | Storage Objects | storage_objects.py |
STORAGE_THEME = palette.THEME_PALETTE[4] (yellow) |
The Summary tab re-uses the KubeResourcesMixin's cpu_total_panel and memory_totals_panel, and also mirrors add_currently_hydrating_panel(...) from compute_objects.py in its Environment Health row.
Tab-by-tab row structure
Summary
- Environment Health — Environment Status, Availability, Last Restart, Currently Hydrating (mirror), Max Lag (Select Time Range), Current CPU Usage, Current Memory Usage
- Environment Info — Materialize Version, Total CPU Capacity, Total Memory
Kubernetes Workloads
- Resources Summary — Total CPU Capacity, Total Memory (includes monitoring)
- Workload Readiness — Pod Readiness, StatefulSet Readiness, Deployment Readiness
- Pod Metrics — Pod CPU Usage, Pod Memory Usage
- Pod Networking — Rx, Tx, Errors, Packet Drops
Cluster Objects / Replicas
- Cluster Summary — Cluster Count, Replica Count
- Replication / Availability — Replica Sizes (donut). (Replica AZs intentionally unwired —
materialize_cloud_availability_zoneis cloud-only AND AZ semantics confuse the target audience; the_az_distribution_panelmethod is kept but not added to the row.) - Cluster Information — Cluster Information table
Connections / Activity
- Connection Summary — Active Sessions, Active Queries, Adapter Command Rate
- Queries — Distribution donut, Query Rate, Peek Latency p50/p90/p99 (3 separate panels)
- Adapter Commands — Adapter Commands by Application table
Compute Objects
- Compute Objects Summary — Active MV, Active Indexes, Active Views, Active Subscribes (donut), Index Types (donut)
- Freshness — Frontier Lag by Cluster, Most-Lagged Collections (both from
mz_dataflow_wallclock_lag_seconds, sentinel-filtered) - Hydration — Currently Hydrating, Hydration Queue Size, Slowest Hydrating Collections (top-15 horizontal bar)
- Dataflows — Dataflow Count, Dataflow Count (per worker), Dataflow Elapsed Rate (log scale)
- Arrangements — Arrangement Rate, Arrangement Rate (per worker), 3 record-count tables (System / User / Transient)
Storage Objects
- Storage Objects Summary — Active Sources, Active Sinks, Active Tables
- Sources — Source Types donut, Sources catalog table, Source Bytes Received (rate), Source Ingestion by Replica (
mz_source_messages_receivedper replica — divergence detector), Source Upstream Errors (commit-failure rate +offset_committed > offset_knowndisconnect indicator, threshold-colored) - Sinks — Sink Types donut, Sink Throughput, Sink Lag (staged minus committed)
- Iceberg Sinks (collapsed by default) — Commit Latency p50/p90/p99, Commit Failures & Conflicts, File & Snapshot Rate
- Kafka Sinks (collapsed by default) — TX Error Rate, Output Buffer, Connect / Disconnect Rate
Known stubs and orphans
compute_objects.pyFreshness row — filled withmz_dataflow_wallclock_lag_seconds(per-cluster max frontier lag + topk laggiest collections). Note the u64::MAX (~1.8e19) sentinel for collections with no established frontier — filtered with< 1e9; the metric is a summary withquantile0/1only (take1for worst-case).dataflows.py— orphaned after Dataflows became a row inside Compute Objects rather than its own tab. Safe to delete; only referenced fromoverview_dashboard.py's import history (now removed).
Self-managed metric migration (done)
The dashboard was migrated off the cloud-only v2_mz_* family and materialize_cloud_organization_id onto self-managed mz_* metrics + materialize_cloud_organization_name filtering (see Deployment target). Also fixed: metrics_datasource() no longer pins a dev datasource name ($metricsDatasource now resolves to the instance default), which was silently breaking every query.
These panels have no self-managed metric and are intentionally kept with a TODO(self-managed) + no_value (they render blank/0 until a metric exists, rather than being deleted):
- Compute Objects: Slowest Hydrating Collections (per-collection hydration time —
v2_mz_compute_hydration_time_secondsis cloud-only, no self-managed equivalent confirmed with the team for this release; description points atmz_internal.mz_compute_hydration_timesSQL). (Currently Hydrating was since revived via the wallclock-lag sentinel — see below; Active Indexes and Index Types were wired tomz_indexes_count.) - Cluster Objects: Replica Availability Zones —
materialize_cloud_availability_zoneis cloud-only AND AZ semantics confuse the end-user audience, so it was intentionally unwired (method kept, not added to the row). Don't re-add without product sign-off.
Currently Hydrating = wallclock-lag sentinel count (no status metric exists): there is no source/sink/object status or hydration-state metric on self-managed. But a collection with no established output frontier reports the mz_dataflow_wallclock_lag_seconds u64::MAX sentinel (> 1e15), so count(... > 1e15) (with instance_id!="") is a real-time hydration-queue proxy: it spikes briefly whenever a replica restarts (dataflows re-hydrating) and drains back to 0 — that's the signal we wanted, NOT "stuck." A count that stays elevated is the genuinely-broken case (e.g. pg_src2, status created, never hydrated — it sits persistently at 1). This backs the revived Currently Hydrating stat (Summary mirror + Compute -> Hydration row); a neutral sparkline, deliberately NOT alarm-colored, since brief spikes are normal. Metrics expose only collection_id; the description hands off to mz_internal.mz_hydration_statuses WHERE NOT hydrated / mz_source_statuses / the console Objects view for names. (An earlier separate red "Stuck Objects" stat was removed — same query, but alarm-on-any false-fired on every routine restart.)
Complementary failure-mode signals now exist (none is a status metric — that's SQL-only):
- Currently Hydrating (Summary + Compute -> Hydration) — wallclock-lag sentinel count; brief spike on replica restart = normal (re)hydration, sustained non-zero = a collection that never got a frontier (created/failed-to-start, e.g.
pg_src2). - Frontier Lag (Compute -> Freshness) — hydrated but falling behind.
- Source Upstream Errors (Storage -> Sources) and the Kafka/Iceberg sink error panels — two source signals on one panel: commit-failure rate (
mz_source_offset_commit_failures— upstream reachable but rejects the commit) AND a disconnected 0/1 indicator (offset_committed > offset_known— broker/DB unreachable sooffset_knowncollapsed; theBrokerTransportFailurestall). The latter is essential: commit-failures does NOT fire for an unreachable broker (the source never reaches the commit step), which surprised us mid-testing — a fully broker-down Kafka source satstalledwith commit-failures flat at 0, and only the offset-disconnect signal (plus frontier lag) caught it. - Source Ingestion by Replica (Storage -> Sources,
mz_source_messages_receivedper replica) — a silent per-replica stall: a restarted replica that can't resume pulling reads 0 while siblings ingest, but the source staysRunningand aggregates (and commit-failures = 0) hide it. This was a real gap —sum by (source_id)aggregate panels mask per-replica failures; the per-replica split (like the per-worker dataflow panel) is the only metric-side place it shows. Pairs with climbing Frontier Lag.
The Storage / "Sources and Sinks" tab was later rebuilt against live sources/sinks (real RDS/MSK upstreams on cluster ingest):
- Active Sources/Sinks, Source Types, Sink Types, and the Sources catalog table now use
mz_storage_objects— the progress-free catalog metric (count(group by (id) (...))). This fixes themz_sources_count/mz_sinks_countprogress-subsource double-count (3 PG sources →type="postgres"=6). - Sources by Status → renamed Sources: there is no source/sink status metric on self-managed, so it's a catalog table (id/type/connection/envelope/cluster); live status is SQL-only (
mz_internal.mz_source_statuses). - Throughput/lag/Iceberg/Kafka sink panels:
_COMPUTE_FILTER(long-formcluster_environmentd_*ids) verified against livemz_source_bytes_received/mz_sink_bytes_committed. Caveat: the$mzClusterListpicker lists compute clusters only, so a storage-only ingest cluster isn't selectable — default "All" shows everything.
Cloud/self-managed convergence (SQL metric prefix): SQL-derived metrics differ only by prefix between envs (mz_X self-managed / v2_mz_X cloud). The prefix is now baked in at generation time from GLOBAL_DASHBOARD_CONFIG.sql_metric_prefix (default mz_), interpolated as variables.SQL_METRIC_PREFIX in query f-strings — e.g. f"{variables.SQL_METRIC_PREFIX}compute_cluster_status". This replaced the old $sqlMetricPrefix Grafana variable (auto-detected via …compute_cluster_status), which GMP can't run since it can't do query_result(...) detection. Only prefix SQL-derived metrics (catalog *_count, compute_cluster_status, storage_objects, object_id, workload_clusters, arrangement-introspection, dataflow_elapsed, compute_hydration_time_seconds); genuine instrumentation (arrangement_maintenance, source_*/sink_* throughput, peek_duration, query_total, wallclock_lag, …) is bare mz_ in both envs and must NOT be prefixed (would become a nonexistent v2_mz_… in cloud). Table excludeByName must list both resolved names. A v2_mz_ variant is planned (config-driven, separate build — SQL_METRIC_PREFIX is captured at import, so one process emits one prefix; threading a build context to emit both is under consideration). See style guide → Converging cloud and self-managed.
Filter fragments are inlined Python, not ConstantVariables: the $environmentFilter / $containerFilter / $clusterFilter / $replicaFilter hidden ConstantVariables were removed — Grafana's constant-variable interpolation mangled their nested $…List refs and embedded commas. Now variables.ENVIRONMENT_FILTER (a constant) and variables.container_filter(*extra) (→ CONTAINER_FILTER in k8s_resources.py) are interpolated directly into query f-strings. See style guide → Intermediates.
Duplicate-job dedup: this instance runs 4 Prometheus jobs against the same clusterd :6878 endpoint, so mz_source_* / mz_sink_* / mz_arrangement_* / mz_compute_replica_history_* each appear under multiple job values and a plain sum(rate(...)) reads 4×. Fixed by wrapping the inner rate/gauge in max without (job) (...) on the affected sum-rate panels (storage source/sink throughput/lag/Iceberg/Kafka, compute arrangement maintenance rate, dataflow elapsed). max by(...) and histogram_quantile panels are already job-invariant. Do not exclude job names by pattern — several metrics (mz_compute_cluster_status, mz_storage_objects, mz_dataflow_elapsed_seconds_total, the *_count metrics) live only on a "legacy" job here, so an exclusion list blanks real panels. The real root cause is the overlapping scrape config (helm/Prometheus) — fixing it there makes the max without (job) wraps no-ops. See Known metric quirks.
Datasource scrape interval (empty rate() panels): Prometheus here scrapes every 60s, but the Grafana datasource is provisioned (via terraform/helm) without jsonData.timeInterval, so it defaults to 15s and $__rate_interval collapses to ~1m — a single sample, so every rate()/increase() panel renders blank despite live data. Fix is jsonData.timeInterval: "60s" on the datasource (matches the real scrape_interval); see Rate intervals. In flight as of this writing — the shared terraform/helm datasource block may still be unpatched, so a freshly-provisioned stack will show empty rate panels until timeInterval is set. Quick check on any instance: count_over_time(<metric>[1m]) returning 1 means the window is too short.
Local push: gcx context local-mzmon → http://localhost:13000. Render+merge+push helper lives at /tmp/mzmon_push.sh (renders the module, carries the live resourceVersion + folder annotation forward, gcx dashboards update). The Grafana MCP is also wired to the same local instance for query verification.
Reference environments
Materialize developers may have access to an internal shared Grafana with multiple test environments. It can be useful to look at queries in live environments when building dashboards. Do not use environments without explicit permission.
When testing against a cloud shared env, scope queries to one environment so you don't mix data across tenants. The dashboards target self-managed Materialize, where the scoping label is materialize_cloud_organization_name="..." (cloud's hex materialize_cloud_organization_id does not exist on self-managed, and neither does the v2_mz_* metric family). Always verify which labels/metrics actually exist on the instance you're querying with list_prometheus_label_names / list_prometheus_metric_names before assuming — see Deployment target: self-managed vs cloud.
Cleanup / refactor candidates
Tracked items that are working but could be tidier:
ENV_SCOPED_NOTEis duplicated incompute_objects.pyandstorage_objects.py. Lift tovisualization.py(or a sibling_messages.pyif it grows)._COMPUTE_FILTERand_ARRANGEMENT_FILTERare the same string in two modules. Lift to a shared place; rename to something neutral like_LONGFORM_CLUSTER_FILTER.dataflows.pyis orphaned. Safe torm.- Hydration is SQL-only on self-managed. No Prometheus metric exposes per-collection hydration state/time (
v2_mz_compute_hydration_time_secondsis cloud-only;mz_compute_controller_hydration_queue_sizeis just the controller queue and reads 0 even with many objects mid-hydration). The two Hydration panels stay backed by the cloud metric (blank here) with descriptions pointing atmz_internal.mz_hydration_statuses/mz_compute_hydration_times(SQL); the live metric-side proxy is the Freshness row (wallclock_lag). mz-mon-prefix isn't enforced inMzDashboard.UIDvalues today (the class hasUID = "env-top"andMzDashboard.__init__prefixes it). Consistent across all current dashboards (one). Worth a validator if more dashboards land.