ydb-slo-workload - SKILL.md Agent Skill

name: ydb-slo-workload description: > Help developers build, debug, and configure workloads for YDB SLO Action testing. Use when the user asks to "write an SLO workload", "create a workload for YDB SLO testing", "implement OTLP metrics for SLO", "push metrics to Prometheus from workload", "build Docker image for SLO test", "debug workload metrics not showing up", "configure custom SLO metrics", "set up SLO thresholds", "fix ref label in metrics", or mentions ydb-slo-action, WORKLOAD_REF, sdk_operations_total, sdk_operation_latency. Also use when reviewing or optimizing existing workload code that interacts with the YDB SLO Action infrastructure.

YDB SLO Workload

Guide developers in building workloads that run inside the YDB SLO Action test infrastructure.

A workload is a Docker image that connects to YDB, performs read/write operations, and pushes performance metrics via OTLP. The SLO Action runs two instances (current and baseline) simultaneously under chaos conditions, then compares their metrics.

Workflow

1. Identify the task

Task	What to do
Write new workload	Guide through the contract: env vars, required metrics, OTLP setup, duration handling
Debug metrics	Check metric names, labels (especially `ref`), OTLP endpoint config, push interval
Configure custom metrics	Help write `metrics_yaml` with correct PromQL queries
Configure thresholds	Help write `thresholds_yaml` with appropriate patterns and bounds
Review existing workload	Check compliance with the contract, correct label usage, error handling
Build Docker image	Guide Dockerfile creation, CMD vs command override, resource awareness
Add custom job artifacts	Copy or generate files under `.slo/extra/` in a step after init main; see README "Extra Artifacts"

2. Load references

Task	References to load
Write new workload	`references/workload-contract.md`
Debug metrics	`references/workload-contract.md`, `references/metrics-config.md`
Configure custom metrics	`references/metrics-config.md`
Configure thresholds	`references/thresholds-config.md`
Review existing workload	`references/workload-contract.md`
Build Docker image	`references/workload-contract.md`

3. Key concepts

These are the most common mistakes developers make — always keep them in mind:

The ref label is mandatory. Every metric must include ref set to the value of the WORKLOAD_REF environment variable. Without it, the report cannot separate current from baseline data. This is the #1 cause of "metrics not showing up in report".

Metrics must be pre-computed gauges, not histograms. Latency metrics are sdk_operation_latency_p50_seconds, sdk_operation_latency_p95_seconds, sdk_operation_latency_p99_seconds — the workload computes percentiles itself and pushes gauges. The SLO Action does not compute percentiles from histograms.

Push interval matters. Metrics must be pushed every second. With a typical 10-minute test duration, longer intervals produce too few data points for meaningful analysis.

workload_current_command replaces Docker CMD. It does not append — it replaces the entire command. Design the workload entrypoint to work both with and without extra arguments.

Chaos is expected. YDB nodes will be killed, paused, and network-partitioned during the test. The workload must handle transient connection errors, retries, and timeouts without crashing. The cluster can be as small as 2 database nodes (disable_compose_profiles: extra-nodes), where losing one node removes half the compute — so never pin to a specific node or assume a node count; rely on the ydb hostname and SDK discovery.

Required metrics (exact names):

sdk_operations_total{operation_type, operation_status, ref}
sdk_operation_latency_p50_seconds{operation_type, operation_status, ref}
sdk_operation_latency_p95_seconds{operation_type, operation_status, ref}
sdk_operation_latency_p99_seconds{operation_type, operation_status, ref}
sdk_retry_attempts_total{operation_type, ref}