name: monitoring-background-jobs description: Monitors CockroachDB background job health by identifying failed, paused, and long-running jobs using SHOW JOBS and SHOW AUTOMATIC JOBS. Surfaces schema changes, backups/restores, automatic statistics collection, and SQL stats compaction jobs without DB Console access. Use when investigating schema change delays, failed backups, or automatic job issues. compatibility: Requires SQL access with VIEWJOB privilege or CONTROLJOB role option for cluster-wide visibility. Uses SHOW JOBS and SHOW AUTOMATIC JOBS (production-safe). metadata: author: cockroachdb version: "1.0"
Monitoring Background Jobs
Monitors background job health by identifying failed, paused, and long-running jobs that are distinct from user queries. Uses SQL-only interfaces (SHOW JOBS and SHOW AUTOMATIC JOBS) to surface schema changes, backups/restores, automatic statistics collection, and SQL stats compaction without requiring DB Console access.
Prerequisites
- SQL connection with
VIEWJOBsystem privilege (read-only) orCONTROLJOBrole option (control) - Background jobs are excluded from
SHOW CLUSTER STATEMENTSand from statement statistics surfaced in the DB Console SQL Activity page
Related skills: triaging-live-sql-activity for live queries, profiling-statement-fingerprints for historical query analysis.
Key Interfaces
SHOW JOBS: User-initiated + automatic jobs (last 12h default; retention configurable via thejobs.retention_timecluster setting, default 14 days)SHOW AUTOMATIC JOBS: Automatic jobs only (AUTO CREATE STATS, SCHEMA CHANGE GC, etc.)
See job types reference and job states reference for complete catalogs.
Core Diagnostic Queries
Query 1: Failed Jobs (Last 12 Hours)
Identify jobs that failed with error messages:
-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
finished,
now() - created AS total_duration,
error
FROM j
WHERE status = 'failed'
AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;
Key columns:
error: Failure reason (check for permission errors, disk space, network issues)description: Human-readable description of what the job was doingtotal_duration: How long the job ran before failing
Common failure patterns:
- Permission denied: User lacks required privileges
- Disk space: Backup destination full
- Network timeout: External storage unreachable
- Constraint violation: Restore conflicts with existing data
Query 2: Long-Running Jobs
Find jobs running longer than expected threshold:
-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
status,
running_status,
created,
now() - created AS running_for,
fraction_completed,
coordinator_id
FROM j
WHERE status = 'running'
AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;
Key columns:
running_for: Total elapsed time since job startedfraction_completed: Progress estimate (0.0 to 1.0, NULL if unavailable)running_status: Sub-state details (e.g., "waiting for MVCC GC")
Customizable thresholds:
- Schema changes: 30 minutes to several hours (depends on table size)
- Backups: 1-6+ hours (depends on data volume)
- Automatic jobs: Usually < 30 minutes
Query 3: Paused Jobs
Identify jobs that are paused and may need attention:
-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS paused_for,
coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;
Action required:
Resume with RESUME JOB <job_id> after verifying the pause reason.
Common reasons for paused jobs:
- Manual user pause for maintenance
- Resource constraints (cluster paused the job)
- Error requiring manual intervention
Query 4: Schema Changes Waiting for MVCC GC
Find SCHEMA CHANGE GC jobs waiting for garbage collection:
-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS waiting_for,
running_status
FROM j
WHERE status = 'running'
AND job_type = 'SCHEMA CHANGE GC'
AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;
Interpretation:
- Normal: SCHEMA CHANGE GC jobs wait for data to become garbage-collectable based on the zone-level
gc.ttlseconds(default 4 hours) - Expected duration: Up to
gc.ttlseconds+ some overhead - When to worry: Waiting > 2x
gc.ttlseconds(check the effective value withSHOW ZONE CONFIGURATION FOR ...against the affected table, database, or RANGE — zone configs cascade and may be overridden at any level)
Why this happens: After DROP TABLE/INDEX operations, CockroachDB must wait for all reads at older timestamps to complete before physically removing data. This prevents "time-travel" queries from failing.
See job states reference for detailed MVCC GC explanation.
Query 5: Automatic Job Health (24h Window)
Monitor automatic background jobs like statistics collection:
-- Automatic jobs in last 24 hours
SELECT
job_id,
job_type,
description,
status,
created,
finished,
COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;
Key job types:
AUTO CREATE STATS: Automatic table statistics refresh (critical for query optimizer)AUTO SQL STATS COMPACTION: Periodic cleanup of statement/transaction statistics tables
Health indicators:
- Healthy: Regular successful executions (every few hours)
- Unhealthy: No recent executions, or high failure rate
- Impact of failure: Stale statistics lead to poor query plans and slow queries
Query 6: Jobs by Type and Status
Aggregated view for pattern analysis:
-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
job_type,
status,
COUNT(*) AS job_count,
MIN(created) AS oldest,
MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;
Use case:
- Identify patterns (e.g., all BACKUP jobs failing, multiple schema changes stuck)
- Spot anomalies (e.g., unusual job type volume)
- Track job success rates by type
Query 7: Backup and Restore Progress
Track progress of backup/restore jobs:
-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS running_for,
ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
CASE
WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
((now() - created) / fraction_completed) - (now() - created)
ELSE NULL
END AS estimated_time_remaining,
running_status
FROM j
WHERE status = 'running'
AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;
Key columns:
percent_complete: Progress percentage (0-100)estimated_time_remaining: Rough estimate based on current progress raterunning_status: Detailed status (e.g., "performing backup to s3://...")
Note: fraction_completed may be NULL for some job types or early in execution.
Common Workflows
Workflow 1: Schema Change Stuck Investigation
Scenario: User reports ALTER TABLE or CREATE INDEX appears stuck.
Check for running schema changes:
WITH j AS (SHOW JOBS) SELECT job_id, description, created, now() - created AS running_for, fraction_completed, running_status FROM j WHERE status = 'running' AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE') ORDER BY created;Identify MVCC GC waits:
-- Use Query 4 to find "waiting for MVCC GC" jobsInterpret results:
- If
running_status= "waiting for MVCC GC": Normal for post-DROP cleanup (wait up togc.ttlseconds) - If long-running with low
fraction_completed: Check for contention, large table size, or resource constraints - If failed: Check
errorcolumn for specific failure reason
- If
Next steps:
- MVCC GC wait: Verify the effective
gc.ttlsecondswithSHOW ZONE CONFIGURATION FOR TABLE/DATABASE/RANGE ...against the affected object and wait - Resource constraints: Check cluster CPU/memory usage
- Failed job: Address error (permissions, constraints) and retry operation
- MVCC GC wait: Verify the effective
Workflow 2: Failed Backup Triage
Scenario: Scheduled backup job failed.
Find recent failed backups:
-- Use Query 1 filtered for BACKUP job type WITH j AS (SHOW JOBS) SELECT job_id, description, created, finished, error FROM j WHERE status = 'failed' AND job_type = 'BACKUP' AND created > now() - INTERVAL '24 hours' ORDER BY created DESC;Analyze error messages:
- "permission denied": Check external storage credentials
- "timeout": Network connectivity to backup destination
- "no space left": Destination storage full
- "connection refused": External storage endpoint unreachable
Verify backup destination:
-- Check SHOW BACKUP for successful backups to same destination SHOW BACKUP 's3://bucket/path';Remediate and retry:
- Fix underlying issue (credentials, storage, network)
- Re-run backup command
- Monitor with Query 7 for progress
Workflow 3: Automatic Job Health Check
Scenario: Proactive monitoring of automatic background jobs.
Check AUTO CREATE STATS frequency:
-- Use Query 5 to see recent automatic statistics jobs SELECT job_type, status, COUNT(*) AS job_count, MAX(created) AS most_recent FROM [SHOW AUTOMATIC JOBS] WHERE created > now() - INTERVAL '24 hours' AND job_type = 'AUTO CREATE STATS' GROUP BY job_type, status;Expected pattern:
- Multiple successful AUTO CREATE STATS jobs per day (depends on table update frequency)
- Regular AUTO SQL STATS COMPACTION (typically once per hour)
Warning signs:
- No AUTO CREATE STATS in last 24h: Statistics collection may be disabled
- High failure rate: Check cluster resource constraints or permission issues
- No AUTO SQL STATS COMPACTION: Stats table may grow unbounded
Verify settings:
SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled; -- Should be true SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;
Workflow 4: Long-Running Job Monitoring
Scenario: Track progress of expected long-running operations.
Identify long-running jobs:
-- Use Query 2 with custom threshold WITH j AS (SHOW JOBS) SELECT job_id, job_type, description, now() - created AS running_for, fraction_completed FROM j WHERE status = 'running' AND created < now() - INTERVAL '30 minutes' ORDER BY created;Monitor progress over time:
-- Re-run every 10-15 minutes, track fraction_completed changes -- Example: 0.25 → 0.40 → 0.55 indicates steady progressEstimate completion:
-- Use Query 7 for backup/restore jobs with time estimatesDecide on action:
- Steady progress: Continue monitoring
- Stalled progress (fraction_completed not increasing): Investigate with triaging-live-sql-activity
- Failed: Use Query 1 to check error
Safety Considerations
Read-only operations (all diagnostic queries):
All SHOW JOBS and SHOW AUTOMATIC JOBS queries are read-only and safe to run in production. No performance impact on cluster operations.
Job control operations (opt-in):
CAUTION: Pausing or canceling jobs can have data integrity implications
Only proceed with job control if:
- You have
CONTROLJOBrole option - You understand the implications (e.g., canceling a schema change mid-execution may require manual cleanup)
- You have authorization to interrupt cluster operations
- You've verified the job is truly problematic (not just slow)
Job control commands:
-- Pause a running job (can be resumed later)
PAUSE JOB <job_id>;
-- Resume a paused job
RESUME JOB <job_id>;
-- Cancel a job (terminal - cannot be resumed)
CANCEL JOB <job_id>;
Risks by job type:
- SCHEMA CHANGE: Canceling may leave schema in inconsistent state; prefer PAUSE and investigation
- BACKUP: Canceling is safe (can retry); pausing is better for temporary issues
- RESTORE: Canceling may leave database partially restored; requires cleanup
- AUTO CREATE STATS: Canceling is safe (will retry later automatically)
Best practice: Focus on monitoring and diagnosis; only use control operations when explicitly required and authorized.
See permissions reference for CONTROLJOB role option setup.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
SHOW JOBS returns empty |
No jobs in last 12h, or insufficient privileges | Grant VIEWJOB privilege; verify cluster has recent job activity |
| "waiting for MVCC GC" for many hours | Normal behavior for SCHEMA CHANGE GC after DROP operations | Wait up to gc.ttlseconds (default 4h); check the effective value with SHOW ZONE CONFIGURATION FOR ... against the affected table/database/range |
| Can't pause/resume job: "permission denied" | Missing CONTROLJOB role option |
Use ALTER ROLE <username> WITH CONTROLJOB (not GRANT SYSTEM) |
Job stuck at same fraction_completed |
Job may be processing large batch, or actually stuck | Wait 15-30 min and re-check; if no change, investigate with live query triage |
| No AUTO CREATE STATS jobs | Automatic collection disabled | Check sql.stats.automatic_collection.enabled = true |
SHOW AUTOMATIC JOBS shows old jobs only |
Need to filter by time window | Add WHERE created > now() - INTERVAL '24 hours' |
| Failed job with "schema change GC" error | Expected for post-DROP cleanup failures | Usually safe to ignore; job will retry automatically |
| Job error: "concurrent schema change" | Multiple schema changes on same table | Wait for first schema change to complete, then retry |
Key Considerations
- MVCC GC waiting: Normal for post-DROP cleanup; duration tied to the zone-level
gc.ttlseconds(default 4h) - Automatic job health: Regular AUTO CREATE STATS is critical for query optimizer performance
- Job control: PAUSE is safer than CANCEL; some cancellations require manual cleanup
- Progress:
fraction_completedmay be NULL for some job types
References
Skill references:
Related skills:
- triaging-live-sql-activity - For live query monitoring (job-executed statements may not appear)
- profiling-statement-fingerprints - For historical query analysis (background jobs excluded)
Official CockroachDB Documentation: