monitoring-background-jobs - SKILL.md Agent Skill

name: monitoring-background-jobs description: Monitors CockroachDB background job health by identifying failed, paused, and long-running jobs using SHOW JOBS and SHOW AUTOMATIC JOBS. Surfaces schema changes, backups/restores, automatic statistics collection, and SQL stats compaction jobs without DB Console access. Use when investigating schema change delays, failed backups, or automatic job issues. compatibility: Requires SQL access with VIEWJOB privilege or CONTROLJOB role option for cluster-wide visibility. Uses SHOW JOBS and SHOW AUTOMATIC JOBS (production-safe). metadata: author: cockroachdb version: "1.0"

Monitoring Background Jobs

Monitors background job health by identifying failed, paused, and long-running jobs that are distinct from user queries. Uses SQL-only interfaces (SHOW JOBS and SHOW AUTOMATIC JOBS) to surface schema changes, backups/restores, automatic statistics collection, and SQL stats compaction without requiring DB Console access.

Prerequisites

SQL connection with VIEWJOB system privilege (read-only) or CONTROLJOB role option (control)
Background jobs are excluded from SHOW CLUSTER STATEMENTS and from statement statistics surfaced in the DB Console SQL Activity page

Related skills: triaging-live-sql-activity for live queries, profiling-statement-fingerprints for historical query analysis.

Key Interfaces

SHOW JOBS: User-initiated + automatic jobs (last 12h default; retention configurable via the jobs.retention_time cluster setting, default 14 days)
SHOW AUTOMATIC JOBS: Automatic jobs only (AUTO CREATE STATS, SCHEMA CHANGE GC, etc.)

See job types reference and job states reference for complete catalogs.

Core Diagnostic Queries

Query 1: Failed Jobs (Last 12 Hours)

Identify jobs that failed with error messages:

-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  finished,
  now() - created AS total_duration,
  error
FROM j
WHERE status = 'failed'
  AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;

Key columns:

error: Failure reason (check for permission errors, disk space, network issues)
description: Human-readable description of what the job was doing
total_duration: How long the job ran before failing

Common failure patterns:

Permission denied: User lacks required privileges
Disk space: Backup destination full
Network timeout: External storage unreachable
Constraint violation: Restore conflicts with existing data

Query 2: Long-Running Jobs

Find jobs running longer than expected threshold:

-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  status,
  running_status,
  created,
  now() - created AS running_for,
  fraction_completed,
  coordinator_id
FROM j
WHERE status = 'running'
  AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;

Key columns:

running_for: Total elapsed time since job started
fraction_completed: Progress estimate (0.0 to 1.0, NULL if unavailable)
running_status: Sub-state details (e.g., "waiting for MVCC GC")

Customizable thresholds:

Schema changes: 30 minutes to several hours (depends on table size)
Backups: 1-6+ hours (depends on data volume)
Automatic jobs: Usually < 30 minutes

Query 3: Paused Jobs

Identify jobs that are paused and may need attention:

-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS paused_for,
  coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;

Action required: Resume with RESUME JOB <job_id> after verifying the pause reason.

Common reasons for paused jobs:

Manual user pause for maintenance
Resource constraints (cluster paused the job)
Error requiring manual intervention

Query 4: Schema Changes Waiting for MVCC GC

Find SCHEMA CHANGE GC jobs waiting for garbage collection:

-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS waiting_for,
  running_status
FROM j
WHERE status = 'running'
  AND job_type = 'SCHEMA CHANGE GC'
  AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;

Interpretation:

Normal: SCHEMA CHANGE GC jobs wait for data to become garbage-collectable based on the zone-level gc.ttlseconds (default 4 hours)
Expected duration: Up to gc.ttlseconds + some overhead
When to worry: Waiting > 2x gc.ttlseconds (check the effective value with SHOW ZONE CONFIGURATION FOR ... against the affected table, database, or RANGE — zone configs cascade and may be overridden at any level)

Why this happens: After DROP TABLE/INDEX operations, CockroachDB must wait for all reads at older timestamps to complete before physically removing data. This prevents "time-travel" queries from failing.

See job states reference for detailed MVCC GC explanation.

Query 5: Automatic Job Health (24h Window)

Monitor automatic background jobs like statistics collection:

-- Automatic jobs in last 24 hours
SELECT
  job_id,
  job_type,
  description,
  status,
  created,
  finished,
  COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
  AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;

Key job types:

AUTO CREATE STATS: Automatic table statistics refresh (critical for query optimizer)
AUTO SQL STATS COMPACTION: Periodic cleanup of statement/transaction statistics tables

Health indicators:

Healthy: Regular successful executions (every few hours)
Unhealthy: No recent executions, or high failure rate
Impact of failure: Stale statistics lead to poor query plans and slow queries

Query 6: Jobs by Type and Status

Aggregated view for pattern analysis:

-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
  job_type,
  status,
  COUNT(*) AS job_count,
  MIN(created) AS oldest,
  MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;

Use case:

Identify patterns (e.g., all BACKUP jobs failing, multiple schema changes stuck)
Spot anomalies (e.g., unusual job type volume)
Track job success rates by type

Query 7: Backup and Restore Progress

Track progress of backup/restore jobs:

-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS running_for,
  ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
  CASE
    WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
      ((now() - created) / fraction_completed) - (now() - created)
    ELSE NULL
  END AS estimated_time_remaining,
  running_status
FROM j
WHERE status = 'running'
  AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;

Key columns:

percent_complete: Progress percentage (0-100)
estimated_time_remaining: Rough estimate based on current progress rate
running_status: Detailed status (e.g., "performing backup to s3://...")

Note: fraction_completed may be NULL for some job types or early in execution.

Common Workflows

Workflow 1: Schema Change Stuck Investigation

Scenario: User reports ALTER TABLE or CREATE INDEX appears stuck.

Check for running schema changes:

WITH j AS (SHOW JOBS)
SELECT job_id, description, created, now() - created AS running_for,
       fraction_completed, running_status
FROM j
WHERE status = 'running'
  AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE')
ORDER BY created;

Identify MVCC GC waits:

-- Use Query 4 to find "waiting for MVCC GC" jobs

Interpret results:
- If running_status = "waiting for MVCC GC": Normal for post-DROP cleanup (wait up to gc.ttlseconds)
- If long-running with low fraction_completed: Check for contention, large table size, or resource constraints
- If failed: Check error column for specific failure reason
Next steps:
- MVCC GC wait: Verify the effective gc.ttlseconds with SHOW ZONE CONFIGURATION FOR TABLE/DATABASE/RANGE ... against the affected object and wait
- Resource constraints: Check cluster CPU/memory usage
- Failed job: Address error (permissions, constraints) and retry operation

Workflow 2: Failed Backup Triage

Scenario: Scheduled backup job failed.

Find recent failed backups:

-- Use Query 1 filtered for BACKUP job type
WITH j AS (SHOW JOBS)
SELECT job_id, description, created, finished, error
FROM j
WHERE status = 'failed'
  AND job_type = 'BACKUP'
  AND created > now() - INTERVAL '24 hours'
ORDER BY created DESC;

Analyze error messages:
- "permission denied": Check external storage credentials
- "timeout": Network connectivity to backup destination
- "no space left": Destination storage full
- "connection refused": External storage endpoint unreachable

Verify backup destination:

-- Check SHOW BACKUP for successful backups to same destination
SHOW BACKUP 's3://bucket/path';

Remediate and retry:
- Fix underlying issue (credentials, storage, network)
- Re-run backup command
- Monitor with Query 7 for progress

Workflow 3: Automatic Job Health Check

Scenario: Proactive monitoring of automatic background jobs.

Check AUTO CREATE STATS frequency:

-- Use Query 5 to see recent automatic statistics jobs
SELECT job_type, status, COUNT(*) AS job_count,
       MAX(created) AS most_recent
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
  AND job_type = 'AUTO CREATE STATS'
GROUP BY job_type, status;

Expected pattern:
- Multiple successful AUTO CREATE STATS jobs per day (depends on table update frequency)
- Regular AUTO SQL STATS COMPACTION (typically once per hour)
Warning signs:
- No AUTO CREATE STATS in last 24h: Statistics collection may be disabled
- High failure rate: Check cluster resource constraints or permission issues
- No AUTO SQL STATS COMPACTION: Stats table may grow unbounded

Verify settings:

SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled;  -- Should be true
SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;

Workflow 4: Long-Running Job Monitoring

Scenario: Track progress of expected long-running operations.

Identify long-running jobs:

-- Use Query 2 with custom threshold
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, description,
       now() - created AS running_for,
       fraction_completed
FROM j
WHERE status = 'running'
  AND created < now() - INTERVAL '30 minutes'
ORDER BY created;

Monitor progress over time:

-- Re-run every 10-15 minutes, track fraction_completed changes
-- Example: 0.25 → 0.40 → 0.55 indicates steady progress

Estimate completion:

-- Use Query 7 for backup/restore jobs with time estimates

Decide on action:
- Steady progress: Continue monitoring
- Stalled progress (fraction_completed not increasing): Investigate with triaging-live-sql-activity
- Failed: Use Query 1 to check error

Safety Considerations

Read-only operations (all diagnostic queries): All SHOW JOBS and SHOW AUTOMATIC JOBS queries are read-only and safe to run in production. No performance impact on cluster operations.

Job control operations (opt-in):

CAUTION: Pausing or canceling jobs can have data integrity implications

Only proceed with job control if:

You have CONTROLJOB role option
You understand the implications (e.g., canceling a schema change mid-execution may require manual cleanup)
You have authorization to interrupt cluster operations
You've verified the job is truly problematic (not just slow)

Job control commands:

-- Pause a running job (can be resumed later)
PAUSE JOB <job_id>;

-- Resume a paused job
RESUME JOB <job_id>;

-- Cancel a job (terminal - cannot be resumed)
CANCEL JOB <job_id>;

Risks by job type:

SCHEMA CHANGE: Canceling may leave schema in inconsistent state; prefer PAUSE and investigation
BACKUP: Canceling is safe (can retry); pausing is better for temporary issues
RESTORE: Canceling may leave database partially restored; requires cleanup
AUTO CREATE STATS: Canceling is safe (will retry later automatically)

Best practice: Focus on monitoring and diagnosis; only use control operations when explicitly required and authorized.

See permissions reference for CONTROLJOB role option setup.

Troubleshooting

Issue	Cause	Fix
`SHOW JOBS` returns empty	No jobs in last 12h, or insufficient privileges	Grant `VIEWJOB` privilege; verify cluster has recent job activity
"waiting for MVCC GC" for many hours	Normal behavior for SCHEMA CHANGE GC after DROP operations	Wait up to `gc.ttlseconds` (default 4h); check the effective value with `SHOW ZONE CONFIGURATION FOR ...` against the affected table/database/range
Can't pause/resume job: "permission denied"	Missing `CONTROLJOB` role option	Use `ALTER ROLE <username> WITH CONTROLJOB` (not GRANT SYSTEM)
Job stuck at same `fraction_completed`	Job may be processing large batch, or actually stuck	Wait 15-30 min and re-check; if no change, investigate with live query triage
No AUTO CREATE STATS jobs	Automatic collection disabled	Check `sql.stats.automatic_collection.enabled = true`
`SHOW AUTOMATIC JOBS` shows old jobs only	Need to filter by time window	Add `WHERE created > now() - INTERVAL '24 hours'`
Failed job with "schema change GC" error	Expected for post-DROP cleanup failures	Usually safe to ignore; job will retry automatically
Job error: "concurrent schema change"	Multiple schema changes on same table	Wait for first schema change to complete, then retry

Key Considerations

MVCC GC waiting: Normal for post-DROP cleanup; duration tied to the zone-level gc.ttlseconds (default 4h)
Automatic job health: Regular AUTO CREATE STATS is critical for query optimizer performance
Job control: PAUSE is safer than CANCEL; some cancellations require manual cleanup
Progress: fraction_completed may be NULL for some job types

References

Skill references:

Related skills:

triaging-live-sql-activity - For live query monitoring (job-executed statements may not appear)
profiling-statement-fingerprints - For historical query analysis (background jobs excluded)

Official CockroachDB Documentation: