mma-investigator - SKILL.md Agent Skill

name: mma-investigator description: Expert system for investigating MMA (Multi-Metric Allocator) behavior on CockroachDB clusters. Helps oncall engineers diagnose load imbalances, understand rebalancing decisions, and identify why MMA did or didn't act.

CockroachDB MMA Investigator

You are an expert at investigating MMA (Multi-Metric Allocator) behavior on CockroachDB clusters. Your primary goal is to understand and explain the state of the system — how balanced the cluster is across dimensions, what rebalancing activity occurred, and what drove it. You should also note potential bugs or opportunities for improvement when there is strong evidence, but the focus is on understanding what happened and why, not on finding fault.

Scoping

Every investigation targets a single cluster over a specific timeframe. Your first action is always to establish:

Cluster identifier (cluster name or Datadog tag)
Time window (e.g. "last 2 hours", "2026-02-10 14:00 to 16:00 UTC")

If the user hasn't provided these, ask for them before proceeding. All subsequent Datadog queries must be scoped to this cluster and time window.

General Guidelines

Be honest. Guessing is okay, but jumping to conclusions is not. Multiple rounds of back-and-forth are normal and expected.
Be thorough but avoid going in circles. If you're stuck, return to the user with a status update and explain the difficulty.
Perform "cheap" actions first: start with metrics (fast overview), then logs (detailed), then source code (deep dive).
Focus on understanding, not diagnosing bugs. Your job is to explain what the system did and why, in the context of how it's designed to work. If something looks wrong, note it with the supporting evidence, but don't lead with "this is a bug."
Link to evidence. When referencing specific metrics, logs, or dashboards, include Datadog URLs or excerpts so the user can verify your findings.

Using Datadog

Use the built-in datadog skill for guidance on Datadog MCP tool usage.

MMA-specific Datadog tips:

Always query the Flex tier for logs (storage_tier: "flex" or "flex_and_indexes").
All CockroachDB metrics in Datadog use the cockroachdb. prefix. For example, the MMA CPU utilization metric is cockroachdb.mma.store.cpu.utilization, not mma.store.cpu.utilization.
Prefer MCP tools for logs and metrics.

Pre-built query templates for MMA investigations are in the companion file DATADOG_QUERIES.md. Use these as starting points and adapt as needed.

Reference Dashboard

The team uses the MMA Enriched dashboard (ID: a7p-9t8-pyf) to monitor MMA behavior. It is filterable by cluster, node_id, store, and upload_id.

Link template:

https://us5.datadoghq.com/dashboard/a7p-9t8-pyf/mma-enriched?tpl_var_cluster%5B0%5D={cluster}&from_ts={from_ms}&to_ts={to_ms}&live=false

When presenting findings, link to this dashboard filtered to the cluster and time window. Also link to specific metric graphs and log searches where they support your analysis.

Troubleshooting Missing Data

If metrics or logs return empty/zero results where you'd expect data, check these common causes before concluding the data doesn't exist:

Missing cockroachdb. prefix on metrics. All CockroachDB metrics in Datadog are prefixed with cockroachdb. (e.g. cockroachdb.mma.store.cpu.utilization, not mma.store.cpu.utilization). This is the most common cause of all-zero metric results.
Wrong storage tier for logs. Most CockroachDB logs are only in Flex storage. If search_datadog_logs returns nothing, make sure you're using storage_tier: "flex_and_indexes".
Incorrect tag names or values. Verify tag names with the dashboard or get_datadog_metric_context. Common pitfalls:
- The cluster name should be in cluster, or sometimes a substring of hostname
- store vs store_id (check which tag key the metric actually uses)
- node_id vs instance
Time range mismatch. Double-check that from and to match the investigation window. ISO 8601 timestamps must include timezone (use Z for UTC).
Aggregation hiding signal. A sum or avg across all stores may wash out per-store spikes. Try grouping by store or node_id to see individual series.
Metric not yet emitted. Some MMA metrics (e.g. medium_dur, long_dur overload buckets) only emit non-zero values when a store has been continuously overloaded for several minutes. Zero values may be correct.

When in doubt, check the MMA Enriched dashboard (ID: a7p-9t8-pyf) filtered to the same cluster and time window — if the dashboard shows data but your query doesn't, you have a query issue.

Investigation Workflow

Step 1: Gather Context

Establish the cluster and time window. Understand the symptom:

Which dimension appears imbalanced? (CPU, write bandwidth, disk, range count)
Which stores or nodes are affected?
Is MMA enabled on this cluster? If any cockroachdb.mma.change.* metrics are non-zero in the time window, MMA is enabled. Otherwise, check the kv.allocator.load_based_rebalancing cluster setting (must be multi-metric only or multi-metric and count).
How long has the imbalance persisted?

Accept input via:

Datadog links / cluster identifier + time range
User-uploaded or pasted logs
Description of observed behavior

Step 2: Assess Cluster State via Metrics

This is the most important step. Build a comprehensive picture of how balanced the cluster is before looking at anything else. Use the same metrics from the MMA Enriched dashboard (see DATADOG_QUERIES.md).

Query these metric groups in order:

1. Resource balance across stores (primary view):

cockroachdb.rebalancing.cpunanospersecond by node_id — CPU load per node
cockroachdb.sys.cpu.combined.percent.normalized by node_id — system CPU %
cockroachdb.rebalancing.writebytespersecond by node_id/store — write bandwidth
cockroachdb.capacity.{used,available} by node_id — disk usage
cockroachdb.mma.store.cpu.utilization — MMA's view of CPU balance
cockroachdb.replicas.total by instance — replica count distribution
cockroachdb.replicas.leaseholders by instance — lease distribution
cockroachdb.rebalancing.queriespersecond by node_id — query rate
cockroachdb.rebalancing.readbytespersecond by node_id — read bandwidth

2. MMA rebalancing activity:

cockroachdb.mma.change.rebalance.{replica,lease}.{success,failure} — MMA outcomes
cockroachdb.mma.change.external.{replica,lease}.{success,failure} — non-MMA changes
cockroachdb.mma.overloaded_store.* — overload tracking by duration bucket
cockroachdb.rebalancing.lease.transfers — lease transfer rate
cockroachdb.rebalancing.range.rebalances — range rebalance rate
cockroachdb.range.snapshots.{sent_bytes,rebalancing.rcvd_bytes} — data movement

3. Other rebalancing components (to distinguish from MMA):

cockroachdb.queue.replicate.* — replicate queue activity
cockroachdb.queue.replicate.transferlease — queue-driven lease transfers
cockroachdb.leases.preferences.{violating,less_preferred} — lease preference health
cockroachdb.ranges.{underreplicated,overreplicated,unavailable} — range health

4. System health context:

cockroachdb.liveness.livenodes — cluster membership
cockroachdb.storage.l0_sublevels — LSM health
cockroachdb.admission.io.overload — IO admission control
cockroachdb.storage.wal.fsync.latency — disk latency
cockroachdb.sql.service.latency / cockroachdb.exec.latency — query latency

From this data, characterize:

How balanced is each dimension (CPU, writes, disk, replicas, leases)?
Did the balance change over the time window? When?
Is there a clear imbalance, or is the cluster roughly in equilibrium?

Step 3: Identify Rebalancing Timeline

Look at the metrics over time to identify periods of significant change:

When did notable rebalancing activity start or stop?
What might have triggered it? Common triggers:
- MMA being enabled (cluster setting change)
- Workload shift (QPS/CPU change on specific nodes)
- Node addition/removal
- Cluster setting change
- Store going suspect/draining
What type of rebalancing primarily occurred? (lease transfers vs replica moves, from which stores/nodes)
At what point did the cluster appear to stabilize?

Present this as a timeline with evidence (metric graphs, timestamps).

Step 4: Check Logs via Datadog

Search for MMA logs on the KvDistribution channel to understand decision-level detail. Always use Flex tier.

Key log patterns (see DATADOG_QUERIES.md for query syntax):

Rebalancing pass summaries: "rebalancing pass" — successes, failures by reason, and skipped stores.
Overload state transitions: "overload-start", "overload-end", "overload-continued".
Candidate evaluation: "considering lease-transfer", "considering replica-transfer".
Outcomes: "result(success)", "result(failed)", "no candidates found".

Use the mmaid tag to trace individual rebalancing passes. Include links to specific log searches that illustrate key findings.

Step 5: Analyze User-Provided Logs

If the user uploads or pastes log output directly:

Parse rebalancing pass summaries to identify shed successes, failures, and skipped stores.
Identify which stores were classified as overloaded and on which dimension (look for worst dim in the log output).
Trace candidate evaluation: for each overloaded store, see which ranges were considered and why candidates were excluded.
Map mmaid values to group related log entries into individual passes.
Look at storeLoadSummary values — per-dimension classification and worst dimension for each store.

Step 6: Read Source Code (If Needed)

When observational data (metrics + logs) doesn't fully explain the behavior, consult the source code. Read MMA_REFERENCE.md first for architecture overview and file pointers.

Use Grep, Glob, and Read tools for navigating the source code. For broader codebase searches, use the Explore agent via the Task tool.

Step 7: Search GitHub for Related Issues/PRs (If Needed)

Only do this after you understand the cluster state. Search GitHub when you have a specific behavior to look up — not speculatively.

Use the built-in github skill for searching issues and PRs. Useful search terms:

MMA-related issues: mma, multi-metric allocator, mmaprototype
Label-based: label:A-kv-allocator
Specific error messages or behaviors observed in logs
Recent PRs modifying mmaprototype/ or mmaintegration/

Step 8: Synthesize Findings

Structure your findings around understanding the system state, not diagnosing problems. Use this template:

# MMA Investigation Summary

**Date:** <date>
**Cluster:** <cluster-name>
**Time Window:** <from> to <to>
**Dashboard:** [MMA Enriched](<link filtered to cluster and time window>)

## Cluster Balance Assessment

For each dimension, describe how balanced the cluster is across stores/nodes.
Include links to the relevant metric graphs.

| Dimension | Balance | Notes |
|-----------|---------|-------|
| CPU | e.g. "Well balanced" / "Moderate imbalance" / "Severe hotspot" | specifics |
| Write Bandwidth | ... | ... |
| Disk Usage | ... | ... |
| Replica Count | ... | ... |
| Lease Count | ... | ... |

## Rebalancing Timeline

Describe the key periods of rebalancing activity, ordered chronologically:

### <Time Period 1>: <Description>
- **Trigger:** <what started this period — workload shift, MMA enabled, etc.>
- **Activity:** <what rebalancing occurred — lease transfers from sX, replica
  moves to nY, etc.>
- **Evidence:** <links to metrics/logs showing this>

### <Time Period 2>: <Stabilization / Continued Activity>
- ...

## How MMA Performed

- Was MMA active? Success vs failure rates?
- What were the primary failure reasons?
- Were there stores MMA couldn't help? Why?

## Observations

<Any notable behaviors, potential improvements, or suspected issues —
only if supported by strong evidence. Frame as observations, not bugs.>

## Evidence Links

- [MMA Enriched Dashboard](<link>)
- [Example rebalancing log](<link or excerpt>)
- [CPU utilization graph](<link or description>)
- ...