name: ltr-backup-out-of-sla description: Debug Azure SQL Database Long-Term Retention (LTR) backup out of SLA alerts for Sterling databases. Analyzes LTR policies, validates incidents, identifies root causes including full backup failures, LTR copy timeouts, and elastic pool issues. Required inputs from calling agent - kusto-cluster-uri, kusto-database, and database configuration variables from get-db-info skill.
Debug LTR Backup Out of SLA Alerts
Debug Long-Term Retention (LTR) backup out of SLA alerts for Azure SQL Database Sterling databases when backups do not happen by expected time.
Background
LTR backups are based on full backups. The Backup/Restore service LTR threads copy qualified full backups (according to customer policies) to LTR containers. LTR backups are expected to happen and be kept according to customers' set policies.
Important: This guide is only for Sterling databases. For VLDB databases, refer to TSG HSLTR0002.
Required Information
This skill requires the following inputs (should be obtained by the calling agent):
From User or ICM:
- Logical Server Name
- Logical Database Name
- Logical Database ID (GUID format)
From execute-kusto-query skill:
- kusto-cluster-uri
- kusto-database
- region
From get-db-info skill (optional, if available):
- physical_database_id
- service_level_objective
- All other database environment variables
Note: The calling agent should use the execute-kusto-query skill before invoking this skill.
Critical Pre-Mitigation Steps
๐ฉ BEFORE mitigating the incident:
- Make sure no LTR backup has been skipped or missed
- Backfill any missing backups if required
- Verify next_backup_time has been updated
๐ฉ Only mitigate if needed LTR backups have been backfilled/happened AND next_backup_time updated.
Workflow
1. Check if ICM is Still Valid
Query CMS using HTTPQueryTool or Ltr V2 overview.xts view:
SELECT TOP 10 weekly_retention, monthly_retention, yearly_retention, week_of_year,
dropped_time, logical_database_created_time, last_backup_time,
next_backup_time, state
FROM ltr_configurations
WHERE logical_database_id = '{LogicalDatabaseId}'
ICM is valid if ALL conditions are true:
a) dropped_time is null (database not dropped)
b) Retention policy exists: At least one of weekly_retention, monthly_retention, yearly_retention, or week_of_year must be non-null and not equal to PT0S
c) next_backup_time is before UTCNow (backup is overdue)
d) Based on policy, all LTR backups exist and no backup was skipped or missed
If incident does NOT meet all conditions:
- Mitigate as noise
- Mark 'RCA needed' so expert can investigate
If incident is valid:
- Continue to identify root cause
2. Identify Root Cause from Known Issues
The following are known causes for LTR backup out of SLA:
a) Expected full backup did not happen b) LTR logic intentionally skipped last full backup c) LTR copy timeout in TR d) SubscriptionId disabled (VLDB gap) e) LTR regression (reach out to ltrcommitters@microsoft.com)
Existing Umbrella Incidents:
- Incident 619230339: GeoDr secondary stuck at TLB due to CMK key access
- Incident 626552360: LTR backup copy timeout in TR
- Incident 628197655: LTR backup not happen in VLDB due to subscription disabled
3. Diagnose: Expected Full Backup Did Not Happen
Execute Kusto Query:
Tool: mcp_sqlops_query_kusto
Parameters:
- command: "kusto_query"
- intent: "Check full backup status for LTR analysis"
- parameters: {
"cluster-uri": "{kusto-cluster-uri}",
"database": "{kusto-database}",
"query": "MonBackup | where TIMESTAMP >= ago(5d) | where logical_database_id contains '{LogicalDatabaseId}' | where backup_type == 'Full' or details contains '[LTRV2]' | project TIMESTAMP, event_type, backup_type, details, sql_error_message, logical_database_name, logical_server_name | order by TIMESTAMP desc"
}
Common Issues:
Issue: Databases in Basic SLO - Resource Limitations
Cause: Memory/connections exhaust in Basic tier Mitigation: None available (limitation of Basic SLO)
Issue: Elastic Pool - Full Backup Scheduling Cannot Keep Up
Check if database's full backup has been delayed using Kusto query above.
Diagnosis (CSS can do on SAW):
- Use
Ltr V2 overview.xtsview, tab "Sterling DBs", column "updateslo_script" to identify the affected database
Issue: Restore HeaderOnly Timeout
Error: "The backup metadata is required to create the RecoveryPointMetadata. Check the exception thrown by Restore HeaderOnly."
Issue: Storage Limit Reached
Error: "Failed to flush the commit table to disk in dbid 9 due to error 1132"
Issue: FullBackupContainerName Metadata Missing
Symptoms: LTR backup fails because FullBackupContainerName entry missing in server-level backup storage container metadata.
Issue: "Could not allocate space for object" / "Failed to flush the commit table to disk"
Mitigation: Consult LTR committers (ICM #351989110)
Issue: GeoDR Seeding Caused Full Backup Failure
Error: "Backup, file manipulation operations (such as ALTER DATABASE ADD FILE) and encryption changes on a database must be serialized." Mitigation: Request assistance from GeoDR on-call.
Issue: "Authorization token not loaded for accessing container" or "403 Forbidden"
Mitigation: Refer to TSG LTR0014: SQLDB LTR copy failure alert.
3b. Diagnose: LTR Copy Started But Never Finished (Scenario B)
If MonBackup shows BACKUP_LTR_COPY_START but NOT BACKUP_LTR_COPY_END for >3 days, the LTR thread is likely hung.
4. Check for Skipped/Missed Backups and Backfill
Use Ltr V2 overview.xts view:
- Input logical_database_id in filter
- Review "LTR backups" tab to verify all expected backups exist per policy
If backups are missing, determine backfill need:
- Full backup taken <6 days after previous โ not considered for LTR (mitigate as noise + "RCA needed")
- Full backup taken while database was GeoDR secondary โ not considered for LTR (mitigate as noise + "RCA needed")
- Otherwise โ backfill required
If backfills needed:
- Do NOT mitigate incident until backfills complete
- Contact LTR committers for complex scenarios: pixia / tisingh / penzh (ltrcommitters@microsoft.com)
5. Verify LTR Copy Timeout (TR Issue)
Execute Kusto query to check for LTR copy timeout events:
Tool: mcp_sqlops_query_kusto
Parameters:
- command: "kusto_query"
- intent: "Check for LTR copy timeout events"
- parameters: {
"cluster-uri": "{kusto-cluster-uri}",
"database": "{kusto-database}",
"query": "MonBackup | where TIMESTAMP >= ago(5d) | where logical_database_id contains '{LogicalDatabaseId}' | where details contains 'LTR' and details contains 'timeout' | project TIMESTAMP, event_type, details, sql_error_message"
}
If timeout detected:
๐ฉ LTR Copy Timeout Detected
- Link to umbrella incident 626552360
- Escalate to backup/restore team if not resolved
Summary Output Format
After completing analysis, provide summary:
๐ LTR Backup SLA Analysis Summary
Database Information:
- Logical Server: {LogicalServerName}
- Logical Database: {LogicalDatabaseName}
- Logical Database ID: {LogicalDatabaseId}
LTR Policy:
- Weekly Retention: {weekly_retention}
- Monthly Retention: {monthly_retention}
- Yearly Retention: {yearly_retention}
- Next Backup Time: {next_backup_time}
Root Cause:
- {Identified issue category}
- {Specific diagnostic findings}
Recommended Actions: {Numbered list of mitigation steps}
Backfill Status:
- Required: {Yes/No}
- Completed: {Yes/No/Pending}
Related References
- references/queries.md โ Kusto query definitions (LTR100โLTR200)
- references/knowledge.md โ LTR terminology and concepts
- references/principles.md โ ICM validation and pre-mitigation checks
- references/report.md โ Output format templates
- references/prep/sources.md โ TSG source URLs
Contact LTR committers for complex scenarios: pixia / tisingh / penzh
LTR Policy Documentation: https://learn.microsoft.com/en-us/azure/azure-sql/database/long-term-retention-overview