ci-cd-troubleshooting

name: ci-cd-troubleshooting description: "Diagnoses and fixes CI/CD failures in GitHub Actions workflows. Use when CI is failing on a PR, builds are broken, or tests pass locally but fail in CI."

CI/CD Troubleshooting Workflow

When asked to diagnose or fix CI/CD failures (e.g., "Why is CI failing on PR #999?", "Fix the failing build"), follow this workflow to identify the root cause and optionally implement fixes.

How to use this skill
Related skills
Step 1: Identify the Failure
Step 2: Fetch Relevant Logs
Step 3: Diagnose Root Cause
Step 4: Reproduce Locally (if applicable)
Step 5: Report Findings or Fix
- Option A: Diagnostic Report ("Why is CI failing?")
- Option B: Implement Fix ("Fix the failing build")
Step 6: Verify Fix
Special Troubleshooting Considerations

How to use this skill

Attach this file to your Copilot Chat context, then invoke it with a failing PR number, branch, or CI run context. Use Option A for diagnosis-only requests and Option B when the user explicitly asks for a fix.

Related skills

Test Debugging — deep-dive test failure analysis and flakiness fixes
Development Workflow — full TDD execution when implementing CI fixes
PR Readiness Review — final validation before opening or updating a PR

Step 1: Identify the Failure

Get CI Status:
- For PRs: gh pr checks #999
- For branches: gh run list --branch <branch-name> --limit 5
- Note which jobs passed and which failed
Categorize the Failure Type:
- Test failures - Unit tests, integration tests failing
- Linter failures - Rubocop, YARD documentation issues
- Build failures - Dependency installation, compilation errors
- Timeout failures - Jobs exceeding time limits
- Platform-specific failures - Failing on specific Ruby version or OS
Identify Specific Failing Steps:
- Note the exact job name and step that failed
- Record the Ruby version, OS, and other environment details

Step 2: Fetch Relevant Logs

CRITICAL: CI logs can be massive (100K+ lines) and exceed token limits.

Get the Run ID:

gh run list --branch <branch> --limit 1 --json databaseId --jq '.[0].databaseId'

Fetch Failed Job Logs Only:
```
gh run view <run-id> --log-failed
```
This limits output to only failed jobs, making it manageable.
Extract Key Error Information:
- For test failures: Look for stack traces, assertion errors, specific test names
- For linter failures: Extract file names, line numbers, and violation types
- For build failures: Find dependency errors or missing packages
- Use grep to filter logs if still too large:
```
gh run view <run-id> --log-failed | grep -A 10 -B 5 "Error\|FAILED\|Failure"
```
Avoid Full Log Downloads:
- Do NOT use --log without --log-failed unless specifically requested
- If logs are still too large, focus on the most recent or critical failure

Step 3: Diagnose Root Cause

Based on the failure type, investigate:

For Test Failures:

Check if the test exists and what it's testing
Look for recent changes that might have broken the test
Consider environment differences (local vs. CI)
Check for flaky tests (intermittent failures)

For Linter Failures:

Run linters locally: bundle exec rubocop and bundle exec rake yard
Identify specific violations from the log
Check if violations are in files related to recent changes

For Build Failures:

Check dependency versions in Gemfile and git.gemspec
Look for platform-specific dependency issues
Verify Ruby version compatibility

For Timeout Failures:

Identify which test or step is timing out
Check for infinite loops or performance regressions
Consider if it's a resource limitation in CI environment

Step 4: Reproduce Locally (if applicable)

For PR Failures:

Fetch the PR branch:
```
gh pr checkout #999
```
Run the failing tests locally:
```
bundle exec bin/test <test-name>
```

Run linters:

bundle exec rubocop
bundle exec rake yard

For Branch Failures:

Checkout the branch.
Run full CI workflow:
```
bundle exec rake default
```

Step 5: Report Findings or Fix

Determine the appropriate action based on the user's request:

Option A: Diagnostic Report ("Why is CI failing?")

Present findings to the user:

# CI Failure Diagnosis: <Branch/PR>

**Status:** <X of Y jobs failed>

## Failed Jobs
1. **<Job Name>** (<Ruby version>, <OS>)
   - **Step:** <failing step name>
   - **Failure Type:** <test/linter/build/timeout>

## Root Cause
<Explanation of what's causing the failure>

## Error Details
```
<Relevant error messages and stack traces>
```

## Recommendations
- <Specific fix suggestion 1>
- <Specific fix suggestion 2>

**Would you like me to implement a fix, or do you need more information?**

STOP here unless the user asks you to proceed with fixes.

Option B: Implement Fix ("Fix the failing build")

Proceed based on failure type:

Test Failures: Use the full TDD workflow (Phase 1-3) to fix the failing tests
Linter Failures: Fix violations directly, commit with appropriate message (e.g., style: fix rubocop violations in lib/git/base.rb)
Build Failures: Update dependencies or configuration as needed
Timeout Failures: Investigate performance issues, may require user guidance

For PR Failures on Someone Else's PR:

You may not have push access to their branch
Present the fix and ask user to either:
- Push to the PR branch (if they have access)
- Comment on the PR with suggested changes
- Create a new PR with fixes

Step 6: Verify Fix

After implementing fixes:

Run Affected Tests Locally:
```
bundle exec bin/test <test-name>
```
Run Full CI Suite:
```
bundle exec rake default
```
Push and Monitor:
- Push the fixes
- Monitor CI to confirm the fix worked:
```
gh run watch
```

Confirm Resolution:

Fix implemented and pushed. Monitoring CI run...
CI Status: <link to run>

Special Troubleshooting Considerations

Platform-Specific Failures:

If tests pass on macOS but fail on Linux/Windows, document the difference
Check for path separator issues (/ vs. \)
Look for encoding differences
Consider file system case sensitivity

Flaky Tests:

If a test fails intermittently, note this in your diagnosis
Run the test multiple times locally to confirm flakiness
Suggest fixes for race conditions or timing issues

Permission Issues:

If you can't push to a PR branch, clearly communicate this limitation
Provide the exact commands or changes needed for the user to apply

Token Limit Management:

Always use --log-failed to limit output
If logs are still too large, use grep to extract errors
Focus on the first failure if multiple failures exist
Consider running tests locally instead of relying on full CI logs

name: ci-cd-troubleshooting description: "Diagnoses and fixes CI/CD failures in GitHub Actions workflows. Use when CI is failing on a PR, builds are broken, or tests pass locally but fail in CI."