slo-sli - SKILL.md Agent Skill

name: slo-sli description: SLO/SLI definitions, Prometheus recording rules, error budget calculations, and burn rate alerting for service reliability management. allowed-tools: - Bash - curl

SLO/SLI Definitions and Error Budget Management

Overview

This skill provides templates for implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using Prometheus recording rules, error budget calculations, and burn rate alerting. It follows the Google SRE book methodology for multi-window burn rate alerts.

All Prometheus queries use the HTTP API at http://localhost:9090/api/v1/query. Credentials are not required for local Prometheus (HTTP, no auth). Recording rules and alerting rules are YAML blocks that can be added to the Prometheus configuration at docker-compose/prometheus/prometheus.yml.

Connection Defaults

Variable	Default	Description
`OPENSEARCH_ENDPOINT`	`https://localhost:9200`	OpenSearch base URL
`OPENSEARCH_USER`	`admin`	OpenSearch username
`OPENSEARCH_PASSWORD`	`My_password_123!@#`	OpenSearch password
`PROMETHEUS_ENDPOINT`	`http://localhost:9090`	Prometheus base URL

SLI Definition Templates

Availability SLI

The availability SLI measures the ratio of successful requests (non-5xx) to total requests. A value of 1.0 means all requests succeeded; 0.99 means 1% failed.

Note on status code labels: The label name varies by OTel SDK version. Older semconv uses http_status_code; newer stable semconv uses http_response_status_code. Use the Metric Discovery section in the metrics skill to check which label is present, and replace http_response_status_code in the queries below with the variant active in your stack.

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[5m])) / sum(rate(http_server_duration_seconds_count[5m]))'

Per-service availability:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[5m])) by (service_name) / sum(rate(http_server_duration_seconds_count[5m])) by (service_name)'

Latency SLI

The latency SLI measures the ratio of requests completing within a threshold (e.g., 250ms) to total requests. A value of 0.95 means 95% of requests finished within the threshold.

Note on latency thresholds: The le bucket boundary depends on the metric's unit. For _seconds metrics, use le="0.25" for 250ms. For _milliseconds metrics, use le="250". Use the Metric Discovery section in the metrics skill to check which metric name is active.

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_server_duration_seconds_bucket{le="0.25"}[5m])) / sum(rate(http_server_duration_seconds_count[5m]))'

Per-service latency SLI with a 500ms threshold:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_server_duration_seconds_bucket{le="0.5"}[5m])) by (service_name) / sum(rate(http_server_duration_seconds_count[5m])) by (service_name)'

GenAI-Specific SLI

The GenAI SLI measures agent response time objectives using the gen_ai_client_operation_duration_seconds histogram. For example, the ratio of GenAI operations completing within 5 seconds:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(gen_ai_client_operation_duration_seconds_bucket{le="5.0"}[5m])) by (gen_ai_operation_name) / sum(rate(gen_ai_client_operation_duration_seconds_count[5m])) by (gen_ai_operation_name)'

Per-model GenAI availability (non-error operations):

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(gen_ai_client_operation_duration_seconds_count{gen_ai_operation_name!="error"}[5m])) by (gen_ai_request_model) / sum(rate(gen_ai_client_operation_duration_seconds_count[5m])) by (gen_ai_request_model)'

Prometheus Recording Rules

Recording rules pre-compute SLI values at multiple time windows so that SLO compliance queries are fast and efficient. Add these rule groups to docker-compose/prometheus/prometheus.yml under the rule_files section.

Recording Rule Naming Convention

Recording rules follow the pattern:

Pattern	Example
`sli:http_availability:ratio_rate<window>`	`sli:http_availability:ratio_rate5m`
`sli:http_latency:ratio_rate<window>`	`sli:http_latency:ratio_rate5m`

Windows: 5m, 30m, 1h, 6h, 1d, 3d, 30d

Availability Recording Rules

groups:
  - name: sli_availability
    rules:
      - record: sli:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[5m])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[5m])) by (service_name)
        labels:
          sli: availability

      - record: sli:http_availability:ratio_rate30m
        expr: |
          sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[30m])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[30m])) by (service_name)
        labels:
          sli: availability

      - record: sli:http_availability:ratio_rate1h
        expr: |
          sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[1h])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[1h])) by (service_name)
        labels:
          sli: availability

      - record: sli:http_availability:ratio_rate6h
        expr: |
          sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[6h])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[6h])) by (service_name)
        labels:
          sli: availability

      - record: sli:http_availability:ratio_rate1d
        expr: |
          sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[1d])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[1d])) by (service_name)
        labels:
          sli: availability

      - record: sli:http_availability:ratio_rate3d
        expr: |
          sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[3d])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[3d])) by (service_name)
        labels:
          sli: availability

      - record: sli:http_availability:ratio_rate30d
        expr: |
          sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[30d])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[30d])) by (service_name)
        labels:
          sli: availability

Latency Recording Rules

groups:
  - name: sli_latency
    rules:
      - record: sli:http_latency:ratio_rate5m
        expr: |
          sum(rate(http_server_duration_seconds_bucket{le="0.25"}[5m])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[5m])) by (service_name)
        labels:
          sli: latency

      - record: sli:http_latency:ratio_rate30m
        expr: |
          sum(rate(http_server_duration_seconds_bucket{le="0.25"}[30m])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[30m])) by (service_name)
        labels:
          sli: latency

      - record: sli:http_latency:ratio_rate1h
        expr: |
          sum(rate(http_server_duration_seconds_bucket{le="0.25"}[1h])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[1h])) by (service_name)
        labels:
          sli: latency

      - record: sli:http_latency:ratio_rate6h
        expr: |
          sum(rate(http_server_duration_seconds_bucket{le="0.25"}[6h])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[6h])) by (service_name)
        labels:
          sli: latency

      - record: sli:http_latency:ratio_rate1d
        expr: |
          sum(rate(http_server_duration_seconds_bucket{le="0.25"}[1d])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[1d])) by (service_name)
        labels:
          sli: latency

      - record: sli:http_latency:ratio_rate3d
        expr: |
          sum(rate(http_server_duration_seconds_bucket{le="0.25"}[3d])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[3d])) by (service_name)
        labels:
          sli: latency

      - record: sli:http_latency:ratio_rate30d
        expr: |
          sum(rate(http_server_duration_seconds_bucket{le="0.25"}[30d])) by (service_name)
          /
          sum(rate(http_server_duration_seconds_count[30d])) by (service_name)
        labels:
          sli: latency

Query a recording rule value:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sli:http_availability:ratio_rate30d'

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sli:http_latency:ratio_rate1h'

Error Budget Calculation

Common SLO Targets and Allowed Downtime

SLO Target	Error Budget	Allowed Downtime (30 days)	Allowed Downtime (per day)
99.9%	0.1%	43.2 minutes	1.44 minutes
99.5%	0.5%	3.6 hours	7.2 minutes
99.0%	1.0%	7.2 hours	14.4 minutes

Remaining Error Budget

The remaining error budget tells you what fraction of your error budget is still available. A value of 1.0 means the full budget remains; 0.0 means the budget is exhausted; negative means you've exceeded it.

Formula: 1 - (1 - SLI) / (1 - SLO_target)

For a 99.9% SLO target using the 30-day availability SLI:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.999))'

For a 99.5% SLO target:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.995))'

For a 99.0% SLO target:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.99))'

Error Budget Consumption Rate

The consumption rate shows how fast the error budget is being consumed. A value of 1.0 means the budget is being consumed at exactly the expected rate; values above 1.0 mean the budget is being consumed faster than sustainable.

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'

Per-service error budget consumption over the last day:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate1d) / (1 - 0.999)'

Burn Rate Queries

Burn rate measures how fast you are consuming your error budget relative to the SLO. A burn rate of 1.0 means you will exactly exhaust the budget by the end of the SLO window. Higher values mean faster consumption.

Single-Window Burn Rate

Burn rate over a 1-hour window for a 99.9% SLO:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'

Burn rate over a 6-hour window:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate6h) / (1 - 0.999)'

Multi-Window Burn Rate (Google SRE Book Pattern)

The multi-window approach uses two conditions that must both be true before alerting. This reduces false positives by requiring both a short-term spike and a sustained trend.

14.4x Fast Burn — 1h window / 6h window

Detects severe incidents that will exhaust the entire 30-day error budget in ~2 days. Both the 1-hour and 6-hour burn rates must exceed 14.4x:

1-hour burn rate:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999) > 14.4'

6-hour burn rate (confirmation window):

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate6h) / (1 - 0.999) > 14.4'

1x Slow Burn — 3d window / 30d window

Detects slow, sustained degradation that will exhaust the error budget by the end of the SLO window. Both the 3-day and 30-day burn rates must exceed 1x:

3-day burn rate:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate3d) / (1 - 0.999) > 1'

30-day burn rate (confirmation window):

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate30d) / (1 - 0.999) > 1'

Prometheus Alerting Rules for Burn Rate

Add these alerting rules to the Prometheus configuration to trigger alerts when burn rates exceed thresholds. These follow the multi-window pattern from the Google SRE book.

Availability Burn Rate Alerts

groups:
  - name: slo_burn_rate_alerts
    rules:
      - alert: SLOAvailabilityFastBurn
        expr: |
          (
            (1 - sli:http_availability:ratio_rate1h) / (1 - 0.999) > 14.4
          and
            (1 - sli:http_availability:ratio_rate6h) / (1 - 0.999) > 14.4
          )
        for: 2m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "High availability burn rate detected for {{ $labels.service_name }}"
          description: "Service {{ $labels.service_name }} is consuming error budget at 14.4x the sustainable rate. At this rate, the 30-day budget will be exhausted in ~2 days."

      - alert: SLOAvailabilitySlowBurn
        expr: |
          (
            (1 - sli:http_availability:ratio_rate3d) / (1 - 0.999) > 1
          and
            (1 - sli:http_availability:ratio_rate30d) / (1 - 0.999) > 1
          )
        for: 1h
        labels:
          severity: warning
          slo: availability
        annotations:
          summary: "Sustained availability degradation for {{ $labels.service_name }}"
          description: "Service {{ $labels.service_name }} has a burn rate above 1x over 3 days, confirmed by the 30-day window. Error budget will be exhausted before the SLO window ends."

Latency Burn Rate Alerts

groups:
  - name: slo_latency_burn_rate_alerts
    rules:
      - alert: SLOLatencyFastBurn
        expr: |
          (
            (1 - sli:http_latency:ratio_rate1h) / (1 - 0.999) > 14.4
          and
            (1 - sli:http_latency:ratio_rate6h) / (1 - 0.999) > 14.4
          )
        for: 2m
        labels:
          severity: critical
          slo: latency
        annotations:
          summary: "High latency burn rate detected for {{ $labels.service_name }}"
          description: "Service {{ $labels.service_name }} latency SLI is degrading at 14.4x the sustainable rate."

      - alert: SLOLatencySlowBurn
        expr: |
          (
            (1 - sli:http_latency:ratio_rate3d) / (1 - 0.999) > 1
          and
            (1 - sli:http_latency:ratio_rate30d) / (1 - 0.999) > 1
          )
        for: 1h
        labels:
          severity: warning
          slo: latency
        annotations:
          summary: "Sustained latency degradation for {{ $labels.service_name }}"
          description: "Service {{ $labels.service_name }} latency SLI burn rate exceeds 1x over 3 days."

Query active alerts:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/alerts"

Query alerting rules:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/rules"

SLO Compliance Reporting

Current SLI Value

Query the current availability SLI over the 30-day window for all services:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sli:http_availability:ratio_rate30d'

Query the current latency SLI over the 30-day window:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sli:http_latency:ratio_rate30d'

Target Comparison

Check which services are meeting the 99.9% availability SLO:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sli:http_availability:ratio_rate30d >= 0.999'

Check which services are violating the SLO:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sli:http_availability:ratio_rate30d < 0.999'

Budget Remaining per Service

Remaining error budget for each service against a 99.9% target:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.999))'

Burn Rate per Service

Current burn rate for each service over the last hour:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'

Current burn rate over the last day:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate1d) / (1 - 0.999)'

SLO Setup Workflow

Follow these steps to implement SLO monitoring for a service:

Step 1: Define SLIs

Choose the SLIs that matter for your service. Most services need at least availability and latency:

Availability SLI: ratio of non-5xx responses to total responses
Latency SLI: ratio of requests under a threshold (e.g., 250ms) to total requests

Verify the raw metrics exist in Prometheus:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=http_server_duration_seconds_count'

Step 2: Add Recording Rules

Add the recording rule groups from the Prometheus Recording Rules section to your Prometheus configuration. This pre-computes SLI values at all required time windows (5m, 30m, 1h, 6h, 1d, 3d, 30d).

Save the rules to a file (e.g., slo-rules.yml) and reference it in prometheus.yml:

rule_files:
  - "slo-rules.yml"

Reload Prometheus to pick up the new rules:

curl -s -X POST "$PROMETHEUS_ENDPOINT/-/reload"

Verify the recording rules are loaded:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/rules" | python3 -m json.tool

Step 3: Set Targets

Choose SLO targets based on your service requirements:

Service Tier	Availability Target	Latency Target (p99 < threshold)
Critical (user-facing)	99.9%	99.9% within 250ms
Standard (internal)	99.5%	99.5% within 500ms
Best-effort (batch)	99.0%	99.0% within 2s

Step 4: Add Alerts

Add the burn rate alerting rules from the Prometheus Alerting Rules for Burn Rate section. Adjust the SLO target value in the expr field to match your chosen target.

Verify alerts are configured:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/rules" | python3 -m json.tool

Step 5: Query Compliance

Run the compliance report queries from the SLO Compliance Reporting section to verify everything is working:

# Current SLI
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sli:http_availability:ratio_rate30d'

# Budget remaining
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.999))'

# Burn rate
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'

# Active alerts
curl -s "$PROMETHEUS_ENDPOINT/api/v1/alerts"

References

PPL Language Reference — Official PPL syntax documentation. Fetch this if queries fail due to OpenSearch version differences or new syntax.
Prometheus Querying Basics — PromQL syntax reference.

AWS Managed Service Variants

Amazon Managed Service for Prometheus (AMP) (SigV4)

Replace the local Prometheus endpoint and authentication with AWS SigV4 for all PromQL queries in this skill:

curl -s --aws-sigv4 "aws:amz:REGION:aps" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
  --data-urlencode 'query=sli:http_availability:ratio_rate30d'

Endpoint format: https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query
Auth: --aws-sigv4 "aws:amz:REGION:aps" with --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY"
PromQL query syntax is identical between local Prometheus and Amazon Managed Prometheus; only the endpoint and authentication differ

Error budget query via AMP:

curl -s --aws-sigv4 "aws:amz:REGION:aps" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
  --data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.999))'

Burn rate query via AMP:

curl -s --aws-sigv4 "aws:amz:REGION:aps" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
  --data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'

For Amazon Managed Prometheus, recording rules and alerting rules are managed via the AMP Rules Management API rather than local configuration files. Use awscurl or the AWS CLI to upload rule groups.