name: debug-deploy description: Debug GitHub Actions workflow failures and Terraform errors. Use when deployment failed, Terraform state lock, CI/CD pipeline errors, or troubleshooting deploy.yml.
Debug Deploy Workflow Skill
This skill helps diagnose and fix issues with the GitHub Actions deploy workflow for the Learn to Cloud app.
When to Use
- User mentions a failed deployment or CI/CD failure
- Terraform state lock errors
- Authentication or authorization failures
- Azure resource issues during deployment
- Workflow debugging or troubleshooting
Debugging Process
Step 1: Check Recent Workflow Runs
gh run list --workflow=deploy.yml --limit 5
Step 2: View Failed Logs
gh run view <run-id> --log-failed
Or use the debug script in this skill's folder:
.github/skills/debug-deploy/scripts/debug-deploy.sh logs
Step 3: Identify the Issue
Look for these common patterns in the logs:
Terraform State Lock
Pattern: Error acquiring the state lock or state blob is already locked
Cause: Previous workflow was cancelled mid-execution, leaving the state locked.
Fix:
- Extract the Lock ID from the error (looks like
efd4cede-d5a2-61c3-31db-462852989510) - Run:
cd infra && terraform force-unlock -force <lock-id> - Re-run the workflow:
gh run rerun <run-id>
Authentication Failures
Pattern: AuthorizationFailed, AADSTS, unauthorized
Fix: Check the OIDC deployment configuration: AZURE_CLIENT_ID and AZURE_TENANT_ID secrets, plus the AZURE_SUBSCRIPTION_ID repository variable. The federated credential or Azure RBAC assignment may need updating.
Resource Not Found
Pattern: ResourceNotFound or does not exist
Fix: Resource was deleted outside Terraform. Run terraform refresh or re-import.
Azure Quota Exceeded
Pattern: QuotaExceeded
Fix: Request quota increase in Azure portal or clean up unused resources.
Migration Job Failure
Pattern: Run database migrations fails, /ready returns 503 after deploy, or Alembic reports a database error.
Cause: The Azure Container Apps migration job failed before the API image was updated. Common causes are missing PostgreSQL role mapping, migration SQL errors, or job image/env override issues.
Fix:
- Inspect the failed workflow step and the ACA Job logs.
- Verify
migration_identity_principal_idis mapped to the Terraformmigration_postgres_roleoutput in PostgreSQL. - Confirm
az containerapp job startpasses the SHA image,alembic upgrade head, DB env vars,AZURE_CLIENT_ID, and--registry-identity. - Check that
alembic/env.pystill uses psycopg2 and acquires the advisory lock before migrations.
Test Failures
Pattern: FAILED, pytest, AssertionError
Fix: Run tests locally: (cd api && uv run pytest tests/ -v) and (cd packages/learn-to-cloud-shared && uv run pytest tests/ -v)
Lint Failures
Pattern: ruff, lint error
Fix: Run linter locally: (cd api && uv run ruff check . ../packages/learn-to-cloud-shared)
Step 4: Fix and Re-run
After fixing the issue:
gh run rerun <run-id>
Or watch the progress:
gh run watch <run-id>
Quick Commands Reference
| Command | Description |
|---|---|
./.github/skills/debug-deploy/scripts/debug-deploy.sh status |
Show recent workflow runs |
./.github/skills/debug-deploy/scripts/debug-deploy.sh logs |
View and analyze failed logs |
./.github/skills/debug-deploy/scripts/debug-deploy.sh logs <id> |
View specific run's failed logs |
./.github/skills/debug-deploy/scripts/debug-deploy.sh unlock |
Fix Terraform state lock |
./.github/skills/debug-deploy/scripts/debug-deploy.sh rerun |
Re-run most recent failed workflow |
./.github/skills/debug-deploy/scripts/debug-deploy.sh watch |
Watch running workflow |
Debug Script
The debug-deploy.sh script automates the debugging process with automated issue detection.
Prevention
The workflows are configured with these safeguards:
cancel-in-progress: true- Cancels in-progress runs when a new push arrives (indeploy.yml)-lock-timeout=120s- Waits for locks instead of failing immediately (indeploy.ymlterraform job)