name: kcov-debug description: Diagnose and fix ClaudeSec coverage failures — kcov bash coverage (scanner-shell-coverage job), pytest scanner/lib gate, hangs, "no coverage.json found", and floor-below-threshold errors. Use when a coverage CI job fails, hangs, reports 0%/missing coverage, or when raising a coverage floor. user-invocable: true
kcov / Coverage Debugging Playbook
Actionable consolidation of ClaudeSec's hard-won coverage knowledge (PRs #116–#193).
Use this when a coverage job fails, hangs, or you need to move a floor. Source of
truth is always the live .github/workflows/lint.yml; the numbers below were current
as of PR #193 (2026-06) — re-read the workflow before quoting a threshold.
Two independent coverage gates
| Gate | Job | Tool | Floor | Path |
|---|---|---|---|---|
| Python | scanner-unit-tests |
pytest + coverage | 99% (scanner/lib) |
test-reports/coverage.xml |
| Bash | scanner-shell-coverage |
kcov v42 | 90% | kcov-out/merged/ (JSON located via find, path unstable — see §2) |
These are separate. A red "coverage" check is one or the other — read the job name first.
Decision tree
Coverage job is failing/slow?
├── Job HANGS or times out (>30s per test, minutes total)
│ → MISSING OFFLINE GUARD. See §1. Most common kcov failure.
├── "No coverage.json found under kcov-out/"
│ → MERGE-PATH DISCOVERY. See §2.
├── Reports 0% or wildly low bash coverage
│ → KCOV INSTRUMENTATION (wrong target / include-pattern). See §3.
├── "Bash coverage N% is below required threshold 90%"
│ → REAL REGRESSION or floor too high. See §4.
└── Python pytest gate below 99%
→ scanner/lib SUT coverage. See §5.
§1 — Job hangs / per-test timeout (the #190 root cause)
Symptom: scanner-shell-coverage (or any test calling generate_html_dashboard)
runs for minutes; a test hits the timeout 30 cap (rc=124, ::warning::... TIMED OUT).
Root cause (confirmed in #190): un-gated live GitHub API calls in
dashboard-gen.py, NOT kcov ptrace/xtrace amplification. test_output_coverage.sh
alone calls generate_html_dashboard 24 times → 24× network round-trips. The earlier
"xtrace/ptrace overhead" hypothesis was wrong.
Fix: export CLAUDESEC_DASHBOARD_OFFLINE=1.
- CI:
scanner-shell-coveragesets it at the job level (lint.ymljobenv:);scanner-unit-testssets it on the pytest step only — its earlier bash steps don't callgenerate_html_dashboard, so they don't need it. - New tests: self-export it too, belt-and-suspenders.
- Local repro:
CLAUDESEC_DASHBOARD_OFFLINE=1 bash scanner/tests/test_output_coverage.shdrops from 120s+ → ~3.7s. Offline output is byte-identical for coverage purposes, so measured % is unchanged.
The timeout 30 cap (tightened 120→30 in #193) is a defensive backstop, not a
budget — a test that needs >30s under kcov is a bug, almost always this one.
§2 — "No coverage.json found" (kcov v42 merge path)
Symptom: job is green but prints WARN: ... coverage.json not found, or the gate
errors No coverage.json found under kcov-out/.
Root cause: kcov v42 --merge kcov-out/merged kcov-out/*/ writes the merged JSON
into a nested subdir (observed: kcov-merged/, also seen merged-kcov-output/),
NOT directly at kcov-out/merged/coverage.json. A hard-coded path test silently
missed it (#119→#120), and continue-on-error hid the failure.
Fix (already in lint.yml): discover with find, never a fixed path:
cov_json=$(find kcov-out/merged -name coverage.json -print -quit 2>/dev/null || true)
[ -z "$cov_json" ] && cov_json=$(find kcov-out -name coverage.json -print -quit 2>/dev/null || true)
If you touch the merge/summary/gate steps, keep all three using the SAME find-based discovery so they can't drift apart.
§3 — 0% / wrong bash coverage (instrumentation)
Two classic kcov v42 traps (#118):
- Trace the script, not the interpreter: use
kcov OUTDIR script.sh, NOTkcov OUTDIR bash script.sh(the latter instruments the bash binary, yielding 0%). - Include-pattern is substring matching on the literal sourced path. Tests source
via
$SCRIPT_DIR/../lib/checks.sh(a dynamic path), so canonical--include-path=scanner/libfails. Use filename-only:--include-pattern=checks.sh,output.sh.
Also: Ubuntu's apt kcov (v38 on Jammy) is broken for sourced files. The job builds
kcov v42 from source on ubuntu-latest (noble) and caches the binary.
§4 — Bash coverage below 90% floor
The floor is a one-way ratchet — it only moves up, and only with observed headroom.
Lineage: 50%(#123) → 65%(#124) → 85% → 90% (#171/#173/#174 lifted SUT 91.19%→92.36%).
Observed baseline drifts: as of 2026-06 a green main run measured 91.74%, i.e.
~1.7pp headroom over the 90% floor — re-read the run log, don't assume the peak.
- Real regression: a change removed exercised lib lines or added unexercised ones.
Add/extend a
scanner/tests/test_*.shfixture that drives the newlib/code, then re-measure. Do NOT lower the floor to pass. - Raising the floor: only after ≥2 consecutive clean
mainruns show headroom, and never above(observed baseline − ~2–4pp)buffer. Lift the measured baseline with new fixtures first; bump thethresholdconstant in the gate step second.
§5 — Python pytest gate below 99%
scanner-unit-tests runs pytest scanner/ with a 99% floor on scanner/lib (live
~99.12%, #165). Same discipline: add tests covering the uncovered lines (check
test-reports/coverage.xml / Codecov PR comment), don't relax the gate.
Local verification (always offline)
# Python gate
CLAUDESEC_DASHBOARD_OFFLINE=1 pytest scanner/ --cov=scanner/lib --cov-report=term-missing
# A single shell test under kcov (build/install kcov v42 first if needed)
CLAUDESEC_DASHBOARD_OFFLINE=1 kcov --include-pattern=checks.sh,output.sh \
kcov-out/one scanner/tests/test_checks_lib_direct.sh
# Merge + read the percentage
kcov --merge kcov-out/merged kcov-out/*/
cov_json=$(find kcov-out/merged -name coverage.json -print -quit)
python3 -c "import json,sys;print(json.load(open(sys.argv[1]))['percent_covered'])" "$cov_json"
Golden rules
- Offline first: if a coverage job is slow, assume a missing
CLAUDESEC_DASHBOARD_OFFLINE=1before anything else (§1). - Find, don't hard-code the merged
coverage.jsonpath (§2). - Fix coverage by adding tests, not by lowering floors (§4, §5).
- Re-read
lint.ymlfor current thresholds — numbers here drift; the workflow is canonical.
References
.github/workflows/lint.yml—scanner-unit-tests+scanner-shell-coveragejobs (canonical)docs/guides/ci-coverage-journey.md— full historical narrative (#103–#124)docs/reports/session-report-2026-06-01-pm.md— #190 kcov root-cause writeup- Project memory:
project_kcov_v42_path,project_kcov_test_output_coverage_slow,project_scanner_lib_coverage_floor,project_ci_offline_mode