name: ci-cd-troubleshooting description: "Diagnoses and fixes CI/CD failures in GitHub Actions workflows. Use when CI is failing on a PR, builds are broken, or tests pass locally but fail in CI."
CI/CD Troubleshooting Workflow
When asked to diagnose or fix CI/CD failures (e.g., "Why is CI failing on PR #999?", "Fix the failing build"), follow this workflow to identify the root cause and optionally implement fixes.
Contents
- How to use this skill
- Related skills
- Step 1: Identify the Failure
- Step 2: Fetch Relevant Logs
- Step 3: Diagnose Root Cause
- Step 4: Reproduce Locally (if applicable)
- Step 5: Report Findings or Fix
- Step 6: Verify Fix
- Special Troubleshooting Considerations
How to use this skill
Attach this file to your Copilot Chat context, then invoke it with a failing PR number, branch, or CI run context. Use Option A for diagnosis-only requests and Option B when the user explicitly asks for a fix.
Related skills
- Test Debugging — deep-dive test failure analysis and flakiness fixes
- Development Workflow — full TDD execution when implementing CI fixes
- PR Readiness Review — final validation before opening or updating a PR
Step 1: Identify the Failure
Get CI Status:
- For PRs:
gh pr checks #999 - For branches:
gh run list --branch <branch-name> --limit 5 - Note which jobs passed and which failed
- For PRs:
Categorize the Failure Type:
- Test failures - Unit tests, integration tests failing
- Linter failures - Rubocop, YARD documentation issues
- Build failures - Dependency installation, compilation errors
- Timeout failures - Jobs exceeding time limits
- Platform-specific failures - Failing on specific Ruby version or OS
Identify Specific Failing Steps:
- Note the exact job name and step that failed
- Record the Ruby version, OS, and other environment details
Step 2: Fetch Relevant Logs
CRITICAL: CI logs can be massive (100K+ lines) and exceed token limits.
Get the Run ID:
gh run list --branch <branch> --limit 1 --json databaseId --jq '.[0].databaseId'Fetch Failed Job Logs Only:
gh run view <run-id> --log-failedThis limits output to only failed jobs, making it manageable.
Extract Key Error Information:
For test failures: Look for stack traces, assertion errors, specific test names
For linter failures: Extract file names, line numbers, and violation types
For build failures: Find dependency errors or missing packages
Use
grepto filter logs if still too large:gh run view <run-id> --log-failed | grep -A 10 -B 5 "Error\|FAILED\|Failure"
Avoid Full Log Downloads:
- Do NOT use
--logwithout--log-failedunless specifically requested - If logs are still too large, focus on the most recent or critical failure
- Do NOT use
Step 3: Diagnose Root Cause
Based on the failure type, investigate:
For Test Failures:
- Check if the test exists and what it's testing
- Look for recent changes that might have broken the test
- Consider environment differences (local vs. CI)
- Check for flaky tests (intermittent failures)
For Linter Failures:
- Run linters locally:
bundle exec rubocopandbundle exec rake yard - Identify specific violations from the log
- Check if violations are in files related to recent changes
For Build Failures:
- Check dependency versions in
Gemfileandgit.gemspec - Look for platform-specific dependency issues
- Verify Ruby version compatibility
For Timeout Failures:
- Identify which test or step is timing out
- Check for infinite loops or performance regressions
- Consider if it's a resource limitation in CI environment
Step 4: Reproduce Locally (if applicable)
For PR Failures:
Fetch the PR branch:
gh pr checkout #999Run the failing tests locally:
bundle exec bin/test <test-name>Run linters:
bundle exec rubocop bundle exec rake yard
For Branch Failures:
Checkout the branch.
Run full CI workflow:
bundle exec rake default
Step 5: Report Findings or Fix
Determine the appropriate action based on the user's request:
Option A: Diagnostic Report ("Why is CI failing?")
Present findings to the user:
# CI Failure Diagnosis: <Branch/PR>
**Status:** <X of Y jobs failed>
## Failed Jobs
1. **<Job Name>** (<Ruby version>, <OS>)
- **Step:** <failing step name>
- **Failure Type:** <test/linter/build/timeout>
## Root Cause
<Explanation of what's causing the failure>
## Error Details
```
<Relevant error messages and stack traces>
```
## Recommendations
- <Specific fix suggestion 1>
- <Specific fix suggestion 2>
**Would you like me to implement a fix, or do you need more information?**
STOP here unless the user asks you to proceed with fixes.
Option B: Implement Fix ("Fix the failing build")
Proceed based on failure type:
- Test Failures: Use the full TDD workflow (Phase 1-3) to fix the failing tests
- Linter Failures: Fix violations directly, commit with appropriate message
(e.g.,
style: fix rubocop violations in lib/git/base.rb) - Build Failures: Update dependencies or configuration as needed
- Timeout Failures: Investigate performance issues, may require user guidance
For PR Failures on Someone Else's PR:
- You may not have push access to their branch
- Present the fix and ask user to either:
- Push to the PR branch (if they have access)
- Comment on the PR with suggested changes
- Create a new PR with fixes
Step 6: Verify Fix
After implementing fixes:
Run Affected Tests Locally:
bundle exec bin/test <test-name>Run Full CI Suite:
bundle exec rake defaultPush and Monitor:
Push the fixes
Monitor CI to confirm the fix worked:
gh run watch
Confirm Resolution:
Fix implemented and pushed. Monitoring CI run... CI Status: <link to run>
Special Troubleshooting Considerations
Platform-Specific Failures:
- If tests pass on macOS but fail on Linux/Windows, document the difference
- Check for path separator issues (
/vs.\) - Look for encoding differences
- Consider file system case sensitivity
Flaky Tests:
- If a test fails intermittently, note this in your diagnosis
- Run the test multiple times locally to confirm flakiness
- Suggest fixes for race conditions or timing issues
Permission Issues:
- If you can't push to a PR branch, clearly communicate this limitation
- Provide the exact commands or changes needed for the user to apply
Token Limit Management:
- Always use
--log-failedto limit output - If logs are still too large, use
grepto extract errors - Focus on the first failure if multiple failures exist
- Consider running tests locally instead of relying on full CI logs