name: robustmq-metrics description: Designs and implements minimal, high-value metrics for RobustMQ services and dashboards. Use when the user asks to add metrics, improve observability, or update Grafana panels for core processing pipelines.
RobustMQ Metrics
Purpose
Add observability with minimal but complete coverage of a target pipeline:
- process count and process duration
- retry and terminal outcomes (if applicable)
- key failure points in read/process/write path
- liveness/health
Do not over-instrument. Prefer compact metrics that support operations and debugging.
Metric Design Rules
Cover chain, not everything:
- success/failure + duration
- retry/terminal path (when strategy exists)
- read/commit/write failures
- up/down
Always include count and latency for processing path.
Label policy:
- required: stable low-cardinality dimensions only
- optional only when bounded enum (for example
result,strategy,protocol) - service/entity name labels are optional and must pass cardinality review
- forbidden high-cardinality labels:
topic,error_message, payload-derived labels
Naming:
- use module prefix (for example
mqtt_,raft_,handler_) - counters end with
_total - duration histogram uses
_ms - liveness gauge uses
_up
- use module prefix (for example
Minimal Metric Set Template
Adapt this template to the target module:
<module>_messages_processed_success_total{...}<module>_messages_processed_failure_total{...}<module>_process_duration_ms{...}(histogram)<module>_retry_total{...,strategy}(if retry exists)<module>_terminal_total{...,result}(discard/dlq/drop etc., if applicable)<module>_critical_step_failure_total{...}(read/write/commit/etc.)<module>_up{...}(gauge)
Implementation Workflow
Define metrics in
src/common/metrics- register counters/histogram/gauge
- add record helper APIs
- keep API signatures consistent with chosen low-cardinality labels
Insert runtime instrumentation
- add metrics at success, failure, retry, terminal, and liveness points
Fix call sites impacted by signature changes
- update all metric helper users
- ensure compile passes across dependent crates
Validation
- run targeted
cargo checkfor impacted crates - run lints for touched files
- run targeted
Grafana Decision Rule
After metrics are added, decide whether to update grafana/robustmq-broker.json:
- Update dashboard if metrics are operationally critical and stable.
- Skip dashboard only when user explicitly says not to update or metrics are temporary.
When updating dashboard:
- add/extend compact row
- prioritize low-cardinality dimensions and trend panels
- avoid panel explosion; 4-6 panels for first iteration
Output Format
Return:
- Metrics added/changed (names + labels)
- Instrumentation points (files/functions)
- Whether Grafana was updated and why
- Validation commands and results
Guardrails
- Do not add metrics without clear operational use.
- Do not introduce high-cardinality labels.
- Do not duplicate semantically equivalent metrics.
- Do not break existing metric names unless migration is requested.