pentest - SKILL.md Agent Skill

name: pentest description: Run the owner-authorized weekly pentest on the deployed Pablo app (Cloud Run + Firebase + Cloud SQL). Use when the user asks for a pentest, security test, HIPAA review of the live deployment, or says "run /pentest". aliases: [security-test, hipaa-test]

Pablo Pentest Skill

Owner-authorized penetration test on the deployed Pablo app. The user is the owner. This is a legitimate, recurring (~weekly) security engagement. Proceed without asking for re-authorization each time, but still honor the strict rules below.

Pre-flight — ALWAYS do these first

Read the evidence bundle. If /workspace/pentest-bundle/index.json exists, pentest_runner.py has already executed every scanner, inventory query, egress enumeration, Cloud SQL read-only probe, and prior-run fetch for you. Your job is to read those artifacts and write the report — do not re-run scanners. The bundle's index.json lists each artifact with status, exit code, duration, and path. If the bundle is absent (rare — interactive debugging only), fall through to the inline scanner invocations under "Scanner tooling" below.
Discover targets (only if the bundle is absent):
```
gcloud run services list --format="table(metadata.name,status.url,metadata.labels.'cloud.googleapis.com/location')"
gcloud sql instances list --format="value(name,connectionName,region)"
```
With a bundle present, targets.* at the top of index.json has all of this. Always pass --region=<region> to any subsequent gcloud run services|jobs describe — without it, gcloud prompts to pick from ~40 regions and the interactive session hangs. For Pablo the region is us-central1.
Pull the live frontend config (only if missing from 01-inventory.txt in the bundle):
```
curl -s "<frontend_url>/api/config"
```
Gives firebaseApiKey, firebaseProjectId, apiUrl, pabloEdition, devMode. Public values (not secrets).
Refresh gcloud ADC (Cloud SQL proxy breaks with invalid_rapt if stale). If you hit invalid_rapt, tell the user to run gcloud auth application-default login — you cannot do it for them.
Read credentials from env vars — no user paste needed. When the pentest runner launches this skill, it has already bootstrapped an ephemeral MFA-enrolled Firebase user and exported PENTEST_TEST_EMAIL, PENTEST_TEST_UID, and a fresh PENTEST_TEST_ID_TOKEN (TOTP-MFA sign-in already completed in-process — the password and TOTP secret are intentionally NOT exported, to keep them off the env and child processes). Use $PENTEST_TEST_ID_TOKEN directly as the Bearer token for authenticated probes. The token is good for ~1h; pentests are shorter. The account is allowlisted via an Alembic-seeded row (pentest-auto@pablo-pentest.invalid, RFC 2606 reserved TLD); it is not a real customer and carries no practice assignment, so §6 (cross-tenant IDOR) still needs Cloud SQL-sourced foreign UUIDs. If the env vars are absent (interactive debugging outside the runner), fall back to owner-pasted credentials.
Scope note — this report only claims what it can technically observe. Paper controls (BAAs, signed policies, workforce training records, contingency plan documents) are out of scope for automated scanning. The companion document docs/compliance/hipaa-security-rule.md (per-control narrative) is the source of truth for those; this report's §7 control matrix marks them N/S (paper control — see narrative) and does not assert their status. Egress destinations and deployed service config are technically observable and stay in scope.

Strict rules (don't violate)

Rate limits are scoped by target class:
- Own Cloud Run services (*.run.app frontend/backend): ≤20 req/sec steady, short bursts up to 50 req/sec OK for scanner sweeps. Total run budget ≤30,000 requests across all tools.
- Firebase Auth (identitytoolkit.googleapis.com): strict — ≤1 req/sec, ≤10 bad-password attempts total across the whole run. Firebase anti-abuse locks accounts fast.
- Third-party infra (Google infra outside your Cloud Run, Firebase internals, Anthropic, dependencies): zero — do not probe.
No DoS. A sustained burst that trips Cloud Armor or drives Cloud Run autoscaling into new instances is itself a finding — stop and report, do not keep pushing. Never use scanner flags like -t 100 / --threads 10 / -rate 200.
Read-only against Cloud SQL directly. No UPDATE / DELETE / INSERT statements through the psql / cloud-sql-proxy path. The DB-side assertion is: if the pentest touched the DB, it did so only via SELECT.
Writes through the authenticated API are allowed — with cleanup. You may create test patients and therapy sessions via the normal authenticated POST endpoints in order to exercise CRUD paths. Constraints:
- Prefix the first name with PENTEST- and the last name with a run UUID (e.g. PENTEST-ab12cd34) so cleanup is deterministic.
- Never upload real transcript content or real PHI. Synthetic placeholder strings only (e.g. "synthetic test transcript for pentest run ab12cd34").
- At the end of the run, delete every patient/session/appointment you created via the authenticated DELETE endpoints. Final step of every run must be a cleanup pass.
- If cleanup fails partway, report the un-deleted IDs in the findings so the owner can sweep manually.
Test users: if you create test users (identitytoolkit:signUp), delete them at the end via identitytoolkit /accounts:delete. Prefer pre-provisioned test accounts passed in by the owner — only mint new ones when the test explicitly needs a fresh identity.
Exploit to PoC only — one clear reproduction, then stop. No further pivoting.
Stay on *.run.app services belonging to this app and its Cloud SQL instance. Do not probe Google / Firebase infra itself, Anthropic, or other third parties.
No stored XSS payloads. Reflected-only, in your own session.
Redact any PHI/PII in the report to <REDACTED>.
Stop on lockout / WAF / rate-limit and report what you have.

Scanner tooling (pre-installed — use sanctioned invocations)

The pentest container ships with rate-limited scanners. The flags below are mandatory — don't drop them.

nuclei — template-driven vuln/misconfig scan against the backend:

nuclei -u "<backend_url>" \
  -t cves/ -t exposures/ -t misconfiguration/ -t http/default-logins/ \
  -rate-limit 20 -c 10 \
  -severity medium,high,critical \
  -exclude-tags dos,fuzz,intrusive \
  -o /tmp/nuclei.txt

ffuf — endpoint discovery:

ffuf -u "<backend_url>/api/FUZZ" \
  -w /usr/share/seclists/Discovery/Web-Content/common.txt \
  -rate 20 -t 5 -mc 200,201,301,302,401,403 \
  -of json -o /tmp/ffuf.json

sqlmap — only on a specific parameter you already suspect. Never blanket-run:

sqlmap -u "<backend_url>/api/endpoint?param=1" \
  --headers="Authorization: Bearer $ID_TOKEN" \
  --delay 0.5 --threads 1 --level 1 --risk 1 \
  --technique=BT --batch --flush-session --output-dir=/tmp/sqlmap

nikto — narrow tunings:

nikto -h "<backend_url>" -Tuning 2,3,4,5 -maxtime 5m -o /tmp/nikto.txt

testssl.sh — TLS posture:

testssl.sh --severity MEDIUM --color 0 "<backend_url>" > /tmp/testssl.txt

semgrep — static analysis over the baked-in backend source (/app/backend). Feed results into the Static analysis findings section of the report; review each hit manually before elevating.

semgrep scan \
  --config=p/owasp-top-ten --config=p/python --config=p/security-audit \
  --severity=ERROR --severity=WARNING \
  --json --output=/tmp/semgrep.json --metrics=off --quiet \
  /app/backend
jq '.results | group_by(.check_id) | map({rule: .[0].check_id, count: length, paths: [.[].path] | unique})' /tmp/semgrep.json

Playwright + Chromium — JS-console capture, DOM XSS PoC, CSP verification. The global node_modules is symlinked at /node_modules, so any .mjs you drop in /tmp or /workspace can import 'playwright' with no extra setup (ESM ignores NODE_PATH; it walks up from the script looking for node_modules/ and hits /node_modules).

// /tmp/dom_probe.mjs — run: node /tmp/dom_probe.mjs
import { chromium } from 'playwright';
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
page.on('console', msg => console.log(`CONSOLE ${msg.type()}: ${msg.text()}`));
page.on('pageerror', err => console.log(`PAGEERROR: ${err.message}`));
await page.goto(process.env.TARGET_URL);
await page.waitForLoadState('networkidle');
const resp = await page.request.get(process.env.TARGET_URL);
console.log('CSP:', resp.headers()['content-security-policy']);
console.log('HSTS:', resp.headers()['strict-transport-security']);
await browser.close();

Auth flow — Firebase MFA sign-in (bash)

Firebase MFA sign-in is three calls. Use this pattern; the TOTP window matters (Firebase rejects replays).

API_KEY="<from /api/config firebaseApiKey>"
SECRET="<TOTP b32 secret for this account>"

# Wait until the start of a fresh TOTP window to avoid a wasted attempt
until [ $((30 - $(date +%s) % 30)) -lt 3 ]; do sleep 2; done
sleep 3
CODE=$(oathtool --totp -b "$SECRET")

resp=$(curl -s -X POST "https://identitytoolkit.googleapis.com/v1/accounts:signInWithPassword?key=$API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"email":"<email>","password":"<pw>","returnSecureToken":true}')
MFA_CRED=$(echo "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin)['mfaPendingCredential'])")
ENROLLMENT_ID=$(echo "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin)['mfaInfo'][0]['mfaEnrollmentId'])")

# Finalize — MUST include both mfaPendingCredential AND mfaEnrollmentId
curl -s -X POST "https://identitytoolkit.googleapis.com/v2/accounts/mfaSignIn:finalize?key=$API_KEY" \
  -H "Content-Type: application/json" \
  -d "{\"mfaPendingCredential\":\"$MFA_CRED\",\"mfaEnrollmentId\":\"$ENROLLMENT_ID\",\"totpVerificationInfo\":{\"verificationCode\":\"$CODE\"}}" \
  > /tmp/signin.json

ID_TOKEN=$(python3 -c "import json; print(json.load(open('/tmp/signin.json'))['idToken'])")

Gotchas learned the hard way:

mfaSignIn:finalize requires BOTH mfaPendingCredential and mfaEnrollmentId. Omitting mfaEnrollmentId returns a confusing INVALID_ARGUMENT.
A consumed TOTP code → INVALID_CODE. The Bash tool blocks long leading sleeps; use an until [ ... ]; do sleep 2; done loop to wait for the next window.
bash pre-flight cannot do sleep 32 directly — it's blocked. Use the until-loop pattern.

Secrets lookup (GCP)

Secret Manager in the Pablo GCP project is authoritative. Names observed: pablo-database-url, pablo-db-password, AUTH_SECRET, JWT_SECRET_KEY, GOOGLE_CLIENT_ID/SECRET, AUTH_COOKIE_SIGNATURE_KEY. Project id is whatever is in /api/config.firebaseProjectId (e.g., pablohealth-oss).

gcloud secrets list --project=<project>
gcloud secrets versions access latest --secret=pablo-database-url --project=<project>
# parse: postgresql://pablo:<pw>@/pablo?host=/cloudsql/<conn>

Cloud SQL direct access

# Use a high port — the user may have another cloud-sql-proxy running on 5433
cloud-sql-proxy --port 15433 <connection> > /tmp/p.log 2>&1 &
# wait until "ready for new connections" appears in /tmp/p.log
DB_PW=$(gcloud secrets versions access latest --secret=pablo-db-password --project=<project>)
PGPASSWORD="$DB_PW" psql "host=127.0.0.1 port=15433 user=pablo dbname=pablo sslmode=disable" -c "\dt"

Always run read-only at the DB level. SELECT ... LIMIT 5. Queries to prioritize:

\dt — confirm audit_logs exists and has recent rows (SELECT count(*) FROM audit_logs WHERE timestamp > now() - interval '1 day';). A long gap in audit activity is itself a HIGH finding.
Confirm audit schema is PHI-free: \d+ audit_logs should NOT show user_email, user_name, patient_name columns. If it does, that's a regression.
RLS policies: SELECT schemaname, tablename, policyname FROM pg_policies;
SHOW row_security; and per-connection SHOW app.current_user_id;
Scan for unexpectedly plaintext PHI columns (e.g., \d+ patients → SSN, DOB storage).

Kill the proxy with pkill -f "cloud-sql-proxy --port 15433" when done.

Vulnerability exception log (read every run)

docs/pentest/VULNERABILITY_EXCEPTIONS.md is the operator's documented-risk register for advisories that aren't patched within the standard SLA — required by § 164.308(a)(1)(ii)(B). Treat it as first-class evidence: it both suppresses known-and-accepted dep-scan hits in §6 and is itself a thing that can rot.

Find the file. Path varies by how the runner mounted the repo — try in order, stop at the first hit:

EXC_FILE=""
for p in /workspace/pentest-bundle/repo/docs/pentest/VULNERABILITY_EXCEPTIONS.md \
         /workspace/repo/docs/pentest/VULNERABILITY_EXCEPTIONS.md \
         /app/docs/pentest/VULNERABILITY_EXCEPTIONS.md \
         docs/pentest/VULNERABILITY_EXCEPTIONS.md; do
  [ -f "$p" ] && EXC_FILE="$p" && break
done

If no path resolves, that's a MEDIUM finding in §6 — the operator cannot demonstrate documented risk decisions for unpatched advisories. Do NOT silently treat absence as "no exceptions."

Per open entry (under ## Open), parse ### <advisory-id>, Severity, Status, Revisit by, and the entry's age. Age sources, in order:

An explicit Raised: field, if present (preferred — push the operator to add one when missing).

Git first-add date for the entry heading (fallback):

REPO_ROOT="$(git -C "$(dirname "$EXC_FILE")" rev-parse --show-toplevel)"
git -C "$REPO_ROOT" log --diff-filter=A --format=%aI -S"### CVE-2026-3219" -- "$EXC_FILE" | tail -1

The file's first-add date as the conservative upper bound if neither works.

Compute age_days = today - raised. Today is date -u +%Y-%m-%d.

Cross-reference into §6. For any §9 dep-scan hit (pip-audit, trivy, osv-scanner) whose advisory ID is listed under ## Open, do not promote it on dep-scan severity alone — record it as INFO with Status: documented-exception (see §9 Documented exceptions → <advisory-id>). That's the entire purpose of the log.

Staleness rule — HIGH finding when triggered. Any open entry with age_days > 30 ships in §6:

ID: PABLO-EXC-STALE-<advisory-id>
Title: Vulnerability exception stale (>30 days) — refresh or remediate: <advisory-id>
Severity: HIGH (§ 164.308(a)(1)(ii)(B) — documented risk decisions must be re-reviewed; >30 days without update means the decision is unverified against current upstream state)
Description: Exception was raised days ago. Either upstream now ships a fix (close the entry), the situation has changed (refresh Why not patched, Compensating control, and Revisit by with current rationale), or the risk acceptance is silently aging — none acceptable to OCR.
Remediation: Patch the dep if a fix exists, OR update the entry with current rationale and a new Revisit by. Bumping Revisit by alone without re-justifying does not clear this finding — the staleness check is on Raised:/git-add date, not on Revisit by. To reset the clock, add or update a Raised: field to today and document what was re-verified.
Owner: copy the entry's Owner (default Kurt Niemi).

If age_days ≤ 30 for every open entry → §6a positive control row: "Exception log fresh — N open, oldest days, all within 30-day refresh window."

Checklist (run in order — keep it updated as findings stabilize)

Recon — TLS (openssl s_client -tls1_1 should fail with alert 70), cert chain, security headers on frontend + backend, /api/config, /api/health, /docs//openapi.json (404 in prod — good), /api/ext/auth/seed-admin (404 expected — referenced but not implemented), robots.txt. Run testssl.sh for full TLS config coverage. 1b. Scripted sweep — run nuclei and ffuf against the backend URL (sanctioned invocations above). Review findings manually; fold HIGH/CRITICAL into the report. Skip nikto / sqlmap unless the sweep surfaces a lead. 1c. Static analysis + dependency scan — semgrep (code), pip-audit (Python deps), trivy image against the deployed backend tag, osv-scanner --recursive /app/backend, gitleaks detect --source /app/backend --no-git. Feeds §8 and §9 of the report. Any gitleaks hit = automatic CRITICAL finding in §6. 1d. Vulnerability exception log review — locate docs/pentest/VULNERABILITY_EXCEPTIONS.md (see "Vulnerability exception log" section above), parse every entry under ## Open, compute age_days per entry, cross-reference advisory IDs against the §1c dep-scan output, and apply the staleness rule (>30 days → HIGH §6 finding). Missing file → MEDIUM finding.
Cloud Run IAM — gcloud run services get-iam-policy <svc>. Both backend and frontend are allUsers → roles/run.invoker by design; auth is at app layer. Verify with unauth curl → expect 401 on /api/{users/me, patients, sessions, admin/users} and 403 "Service auth required" on /api/ext/auth/*.
CORS — preflight with Origin: https://evil.com must not be reflected. Legit origin = the frontend's Cloud Run URL; check allow_credentials: true is only paired with that.
Auth flow — decode JWT (alg=RS256, aud=<projectId>, firebase.sign_in_second_factor=totp). Tamper to alg=none → 401. Swap X-Tenant-ID header → ignored (tenant from JWT).
Unauth surface — /api/ext/auth/check-allowlist and check-status require Bearer. Recurring HIGH check: ext_auth.py:_verify_blocking_function_token must pass audience=<backend URL> to google.oauth2.id_token.verify_token. If the audience= kwarg is still missing, file it again.
Cross-tenant IDOR — do this autonomously, do NOT ask the user to create accounts. The runner has already provisioned two MFA-enrolled ephemeral users for you on this run: tokens in $PENTEST_TEST_ID_TOKEN_A / $PENTEST_TEST_ID_TOKEN_B (emails in $PENTEST_TEST_EMAIL_A / _B, uids in $PENTEST_TEST_UID_A / _B). Exercise both:
- With $PENTEST_TEST_ID_TOKEN_A, POST one patient + one session through the authenticated endpoints (name prefix PENTEST-<run-uuid>) so there is a known-real A-owned record. Capture the returned UUIDs.
- With $PENTEST_TEST_ID_TOKEN_B, GET each of: /api/patients/<A's-patient-uuid>, /api/sessions/<A's-session-uuid>, /api/soap-notes/<A's-soap-uuid> (if created), /api/appointments/<A's-appt-uuid> (if created). Expect 404 from the repo layer (NOT 403 — 403 leaks existence). A 200 is a cross-tenant BOLA = CRITICAL + Breach candidate § 164.402. A 403 is HIGH (info leak via presence signal).
- Repeat write-side probes: PUT/DELETE on the same foreign UUIDs with $PENTEST_TEST_ID_TOKEN_B → expect 404. A 200/204 means a write IDOR — CRITICAL + Breach candidate.
- Probe X-Tenant-ID header swap on the same endpoints — the header must be ignored (tenant is taken from JWT).
- Horizontal BOLA within a shared tenant (A and B land in the same tenant in single-tenant deployments): same probe set exercises clinician-scoped isolation. Record which scenario you're running (cross-tenant vs same-tenant horizontal) in §6.
Fallback — if both tokens are missing (identity bootstrap failed): note explicitly in §12 that "2-account IDOR was not exercised this run — identity bootstrap returned no credentials". Do NOT fabricate results and do NOT attempt to mint new Firebase accounts inline — the runner owns that lifecycle.
Privilege escalation — /api/admin/* with clinician token → ADMIN_REQUIRED. JWT tamper (won't work, Firebase verifies sig).
Injection (targeted) — 1–2 probes each on reflected XSS, SQLi (boolean+time on obvious params), SSRF on iCal feed_url (hostname allowlist + scheme check; don't bother with 169.254.169.254 — it's blocked by hostname).
Rate limiting — 5 bad passwords on the real account via Firebase signInWithPassword (Firebase handles lockout; don't exceed 10). Skip if prior run already tripped lockout.
Upload DoS — Content-Length: 2000000000 + tiny body header-only probe (reject on header = good). Recurring MEDIUM check: sessions.py:upload_audio should read in bounded chunks, not await file.read() before size check.
MFA integrity — POST /api/users/me/mfa-enrolled as a fresh non-MFA sign-up. Recurring MEDIUM: returns a new mfa_enrolled_at without verifying Firebase-side enrollment. Doesn't grant access (JWT claim still gates), but poisons compliance metrics.
Signup hygiene — identitytoolkit /accounts:signUp with @example.invalid should fail in a locked-down deployment. If it succeeds, restrict_signups=false — flag it.
Cloud SQL direct — see section above. audit_logs table existence is the key HIPAA check.

Deliverable — HIPAA-grade report format

Target audience: HHS OCR auditors, Pablo's owner/operator, and an external qualified pentester using this report as their scoping input. Every run produces the full artifact — there is no "lite" mode.

Never skip a finding. Every issue observed — INFO through CRITICAL — lands in §6 with a full write-up. Severity is how findings are ordered and highlighted, not a gate for inclusion. A "clean" run still enumerates the INFO-level observations it considered. §2 highlights CRITICAL/HIGH for the reader; nothing is dropped because it looked minor. If you considered an issue and decided it was a non-issue, that belongs in §6a (Positive controls and items tested clean) with the reasoning, not silently omitted.

Positive findings are first-class. §6a is mandatory — it captures the controls that passed and the things the assessor considered and dismissed, with the same evidence rigor as a finding (what was tested, what the expected behavior was, what was observed). This is what a Covered Entity shows an auditor to demonstrate that the technical safeguards are actually working, not just written down, and that the assessor looked at the full surface.

Read this disclaimer onto every report cover: this is an automated self-assessment driven by an LLM, not an independent qualified third-party pentest. For full HIPAA §164.308(a)(8) defensibility (2024 NPRM anticipated to finalize in 2026), Pablo should still engage an independent qualified pentester at least annually. This weekly artifact complements — does not replace — that engagement, and is meant to surface issues between formal engagements + give the external tester a scoped starting point.

Report sections, in order (use these exact headings):

1. Cover & scope

Covered entity (Pablo Health, LLC); report ID (PABLO-PENTEST-<date>-<run UUID>); reporting period (since prior run in GCS); tester identity (CLI + model); authorization statement; in-scope systems (Cloud Run × 2, Cloud SQL, Firebase, GCS compliance bucket, Secret Manager, backend source); out-of-scope (GCP/Firebase/Vertex infra, third-party source repos, physical, social engineering, wireless); run metadata (project ID, URLs, connection name, region).

2. Executive summary

3–5 bullets, business-risk language. Lead with severity totals and trend vs prior run.

3. Asset inventory & data flow

Observed external egress destinations — enumerate from code + config, then reason about each destination's BAA coverage. List:

Deployed service env vars that route inference: gcloud run services describe pablo-backend --region=us-central1 --format='value(spec.template.spec.containers[0].env)' → record CLAUDE_CODE_USE_VERTEX, GOOGLE_GENAI_USE_VERTEXAI, ANTHROPIC_VERTEX_PROJECT_ID, GOOGLE_CLOUD_LOCATION values.

All external hostnames the backend can reach:

grep -rEoh "https?://[a-zA-Z0-9.-]+" /app/backend | sort -u

For each destination the backend can send request bodies (prompts, transcripts, patient fields) to, determine whether it is covered by a Business Associate Agreement — Vertex AI under the project's Google Cloud BAA is covered; api.anthropic.com, api.openai.com, generativelanguage.googleapis.com (public Gemini direct) are not covered by the Google Cloud BAA. Apply the severity rubric below: ePHI traversing a non-BAA destination is a § 164.504(e) permitted-use violation and a § 164.402 Breach candidate — score per the rubric, do not pre-commit to a specific tier here.

4. Threat model (STRIDE-lite, Pablo-specific)

Actors: unauth internet; authenticated clinician cross-tenant; authenticated clinician horizontal BOLA within tenant; insider with GCP IAM; compromised dependency; compromised subprocessor. Crown jewels: patient records, session transcripts, audit_logs, Firebase/GCP credentials, AUTH_SECRET/JWT_SECRET_KEY. Map each actor to attack paths and to the findings in §6.

5. Methodology & frameworks

OWASP WSTG v4.2, OWASP API Security Top 10 (2023), OWASP ASVS 4.0 Level 2, PTES Technical Guidelines, NIST SP 800-115, HIPAA Security Rule §164.308/.310/.312/.314 (2024 NPRM). List tools from the scanner tooling section that were invoked with which flags.

6. Findings

Severity rubric — HIPAA overlay applies. Raw CVSS underweights PHI impact: a "medium" CVSS CVE that enables PHI access is a § 164.402 Breach and ships as CRITICAL. Use:

effective_severity = max(cvss_tier, phi_impact_tier)

where phi_impact_tier is:

CRITICAL — any unauthenticated PHI read; cross-tenant or horizontal PHI access by an authenticated user; PHI integrity compromise (write/delete across tenant boundary); PHI egress to infrastructure not covered by a Business Associate Agreement (§ 164.504(e) — the disclosure path itself is an impermissible use regardless of whether a Breach has occurred yet); any condition that would meet the § 164.402 "acquisition, access, use, or disclosure of PHI in a manner not permitted" definition of a reportable Breach.
HIGH — authenticated privilege escalation; auth-bypass that could lead to PHI without a second bug; PHI availability loss > 24h; any secret in git (gitleaks hit); missing/non-firing audit logs on PHI routes (§ 164.312(b) gap).
MEDIUM — info disclosure without direct PHI nexus; DoS vector; missing defense-in-depth on a PHI-adjacent path.
LOW — defense-in-depth gap with no PHI nexus.
INFO — observation / posture note.

"PHI-adjacent" means the path touches metadata, signals, or fields that would appear in a HIPAA compliance report, audit trail, or Breach risk-assessment, even when the underlying PHI is not itself exposed. Example: /api/users/me/mfa-enrolled poisons the mfa_enrolled_at field that feeds § 164.308(a)(5) workforce-MFA attestations — PHI-adjacent, MEDIUM. Counter-example: missing HSTS on a static-asset CDN with no session tokens — no PHI nexus, LOW.

Baseline examples (the raw CVSS column is what a pure-technical scorer would emit; effective is what ships):

Finding	CVSS tier	PHI tier	Effective	Note
Unauth `GET /api/patients/<id>` returns 200	High	Critical	Critical	§ 164.402 Breach — reportable
Authed clinician B reads tenant A's patient	High	Critical	Critical	Cross-tenant BOLA = Breach
Stored XSS in session notes	Medium	Critical	Critical	PHI integrity + exfiltration path
`audit_logs` silent gap > 24h on PHI route	Medium	High	High	§ 164.312(b) enforcement failure
Stale npm dep, CVSS 7.5, no PHI reachability	High	None	High	No overlay needed
HSTS header missing	Low	None	Low	Defense-in-depth only
Backend routes inference to `api.anthropic.com` (not Vertex); transcripts contain PHI	Low (config)	Critical	Critical	PHI disclosed to non-BAA subprocessor = § 164.504(e) impermissible use + § 164.402 Breach candidate — `Breach candidate: § 164.402`

Compute and emit a CVSS 3.1 base vector for every finding (e.g., AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:N) alongside the effective HIPAA-overlay severity, so auditors can see both scores and the reasoning.

Breach-candidate flag. Any finding scored CRITICAL because of a PHI nexus (not because CVSS alone was Critical) must carry a Breach candidate: § 164.402 tag in its subsection header, with a sentence naming the specific disclosure path. This is what separates a HIPAA-aware report from a generic CVE dump, and it's the line an OCR reviewer looks for first.

Methodology reference: OWASP Risk Rating Methodology (business-impact multiplier), HITRUST CSF Risk Analysis Guide, HHS OCR "Guidance on Risk Analysis Requirements" under § 164.308(a)(1)(ii)(A). The max(cvss_tier, phi_impact_tier) formula is the common Coalfire/Clearwater pattern for HIPAA-aware pentest reports.

6a. Positive controls and items tested clean

Single evidence-rigor table — no parallel bullet list elsewhere. Columns: Control | What was tested | Expected behavior | Observed evidence | §164 control mapped. Draw from the checklist (§1–§13) — at minimum cover: TLS ≥1.2 enforcement, security headers (HSTS / CSP / X-Content-Type-Options), CORS origin allowlist, unauth 401 on PHI routes, JWT alg=none rejection, X-Tenant-ID header ignored, cross-tenant 404-not-403, admin-route ADMIN_REQUIRED enforcement, Cloud Run egress env vars pointing to Vertex (not public Gemini/Anthropic), audit_logs table present with recent rows, audit schema PHI-free, CI supply-chain scans passing on the deployed tag. Every row cites specific evidence (status code, header value, row count, screenshot-equivalent text output).

Also include controls the model considered and dismissed as non-issues — e.g. "/api/ext/auth/seed-admin returns 404 as expected, not a live backdoor despite setup-solo.sh reference." This is what shows the auditor the technical safeguards are working and the assessor looked at the full surface, not merely what was written down. Dismissed scanner false positives belong in §8 / §9 with the tool context, not here.

7. HIPAA Security Rule control matrix

Copy the table under HIPAA control matrix below into this section, filling Status / Evidence / Gap from this run's observations. Every row required; N/A rows need justification. Every Partial or Fail must link to a finding in §6.

8. Static analysis findings (semgrep)

Only confirmed-real hits, grouped by rule ID with file:line. Dismissed false positives stay here as a Dismissed: subsection with the reason — not duplicated in §6a or Appendix B.

9. Dependency & supply-chain scan

CI already runs pip-audit, trivy, and npm audit on every PR + weekly (see .github/workflows/ci.yml + security.yml). The pentest adds what CI can't: (a) scanning what's currently deployed (may lag main), (b) self-contained evidence inline in the report (auditors shouldn't need to cross-reference the GitHub Security tab). Flag any divergence between "CI clean" and "deployed tag vulnerable" as its own finding.

Subsections:

Python deps — pip-audit: Package | Installed | Vulnerable | Fixed | CVE | CVSS
Deployed container image — trivy image <deployed tag> where the tag comes from gcloud run services describe pablo-backend --region=us-central1 --format='value(spec.template.spec.containers[0].image)'
Multi-ecosystem — osv-scanner --recursive /app/backend
Secrets in git — gitleaks detect --source /app/backend --no-git (any hit = CRITICAL finding in §6)
Documented exceptions — table of every entry under ## Open in docs/pentest/VULNERABILITY_EXCEPTIONS.md. Columns: Advisory | Package | Severity | Status | Raised | Age (days) | Revisit by | Stale (>30d)?. Note the file path actually used (or "file not found" with the paths searched). For each row whose advisory ID also appears in pip-audit / trivy / osv-scanner output above, add a Cross-ref: line under the row pointing at that scanner's finding ID — and downgrade the §6 entry for that advisory to INFO with Status: documented-exception. Stale rows (Age > 30) get a separate HIGH §6 finding per the staleness rule.

10. Prior-run carry-over & trends

Fetch the newest .md before today from gs://<COMPLIANCE_REPORT_BUCKET>/pentest/. Diff into: Resolved since last run / Persisting (with "consecutive runs open" counter) / Regressions (elevate to HIGH minimum) / New this run. If no prior report exists, mark this as the baseline.

11. Endpoint coverage matrix

Group by auth profile, not by individual route. Columns: Auth profile | Tenant-scoped | Route count | Representative routes | Tests run | Result. One row per (auth_requirement, tenant_scope) combination — e.g. one row covering all require_mfa + tenant-scoped PHI routes, one for require_mfa + admin-only, one for get_current_user_no_mfa, one for public. List 2–3 representative routes per row; the exhaustive enumeration goes in Appendix B under endpoints.txt.

Any route that doesn't fit a known profile (missing Depends(), explicit public=true tag, unusual composition) gets its own row. Every unusual row needs either a positive test result or an explicit "deferred to external pentester" reason — never "ran out of time."

Enumerate the full set for the appendix:

grep -rnE "@router\.(get|post|put|delete|patch)\(" /app/backend/app/routes/ > /tmp/endpoints.txt

12. Automated assessment scope boundaries

Describe what is outside the boundary of this automated engagement and why, framed as scope constraints rather than gaps. The audience is auditors establishing what human-led testing should cover next; frame each entry as "requires human judgment / multi-session context / physical access / social engineering" rather than "we didn't test this." Entries belong here when they fall outside what any automated tool can assert — business-logic depth, multi-step stateful workflows, novel zero-days, prompt-injection depth, physical controls, social engineering. Do not list items that were simply skipped due to time; those are findings or fallback notes in the relevant section.

Static analysis exclusions (semgrep): The following paths are excluded from semgrep via --exclude flags (rationale mirrored in /.semgrepignore). A human reviewer or independent pentester should audit these paths directly:

Excluded path	Rule(s) suppressed	Rationale
`alembic/`	`avoid-sqlalchemy-text`	Migration DDL; `text()` calls are operator-authored SQL, never reachable from web requests
`tests_integration/`, `tests/`	`avoid-sqlalchemy-text`	Test fixture DDL; not on the production attack surface
`app/db/__init__.py`, `app/db/provisioning.py`	`avoid-sqlalchemy-text`	Tenant schema management DDL (`CREATE SCHEMA`, `SET search_path`, `pg_advisory_lock`, RLS policy setup); all `text()` arguments are system-generated, never from user input
`app/jobs/pentest_*.py`	`dynamic-urllib-use-detected`	Self-assessment tooling; dynamic outbound calls target IAM-gated GCP admin APIs and operator-configured webhook URLs
`app/jobs/hipaa_log_review.py`	`dynamic-urllib-use-detected`	Audit log reviewer; urllib targets an operator-configured webhook (admin-only runtime config)

Inline # nosemgrep annotations suppress individual false-positive hits where exclusion would be too broad (e.g. auth/service.py unverified-JWT routing helper, logger calls in auth handlers).

13. Prioritized remediation roadmap

Ordered list grouped by severity. Columns: Finding ID | Severity | Effort (S/M/L) | Target date | Owner | Retest by. The Retest by column defaults to "next scheduled run" — only override for CRITICAL items that need sooner re-verification (e.g. "next run + manual curl confirmation within 7 days"). The §15 retest-plan section was merged here; if a finding needs a bespoke retest procedure beyond "run this skill again," describe it inline in that finding's §6 subsection under a Retest procedure: line.

14. Appendices

The pentest_runner.py wrapper uploads every raw scanner artifact to gs://<COMPLIANCE_REPORT_BUCKET>/pentest/<run-uuid>/raw/ with retention lock. The report inlines findings-level output (the specific lines that drove a conclusion) and links to the GCS object for full raw dumps. Inline blocks are capped at ~50 lines each — anything longer, link out.

A: Commands executed — chronological list of every shell command/API call the run made (redact tokens and any PHI). One command per line. Audit defense: an OCR reviewer must be able to reconstruct what actually ran. This one stays fully inline — it's short and load-bearing.
B: Scanner invocations & findings — one subsection per tool (nuclei, ffuf, semgrep, pip-audit, trivy, osv-scanner, gitleaks, testssl.sh, nikto/sqlmap if invoked, plus the endpoints.txt enumeration from §11). For each: exact invocation, exit code, summary line counts (total / high / medium / low / info), inline excerpts of the specific lines that became §6 findings, and a Raw output: gs://.../raw/<tool>.txt link. Dismissed false positives live here with their dismissal reason — do not duplicate them in §6a.
C: Cloud SQL query log — every SELECT run against the DB and the row counts returned (values redacted). Stays fully inline.
D: SBOM — link to gs://.../raw/sbom.cyclonedx.json; inline a one-line summary (component count, critical CVE count). If the image scan is unavailable, note that and link to pip-list.txt instead.
E: Attestation block — tester identity (CLI + model), run UUID, ISO timestamp, SHA256 of the report body (excluding this block). Unsigned (automated run); needs human countersign before the operator uses it as input to the annual §164.314(a) written verification to Covered Entities.

Output handling: emit the complete markdown report — through all appendices — to stdout. Do NOT gsutil cp or otherwise upload to GCS yourself; the calling runner (pentest_runner.py) captures stdout and uploads to the retention-locked compliance bucket. Uploading from inside the skill creates duplicate objects with inconsistent metadata. The final thing you emit must be the closing of appendix E; do not append a trailing "uploaded to gs://…" line.

Length & tone: body 2000–3000 words; appendices on top of that. Inline evidence for anything ≤50 lines; link to the GCS raw-artifact bucket for longer dumps (see §14). Audit-ready neutral tone. Every "pass" claim cites observable evidence.

HIPAA control matrix (copy into §7 of every report)

Administrative safeguards (§164.308)

Control	Requirement	Status	Evidence	Gap
§164.308(a)(1)(ii)(A)	Risk analysis — accurate, thorough, documented; reviewed ≥12mo
§164.308(a)(1)(ii)(B)	Risk management — reduce risks to reasonable level
§164.308(a)(1)(ii)(D)	Information system activity review — logs, access reports, incident reports
§164.308(a)(3)(i)	Workforce authorization / supervision
§164.308(a)(3)(ii)(C)	Termination procedures — revoke access on departure
§164.308(a)(4)(ii)(B)	Access authorization — granted per role
§164.308(a)(4)(ii)(C)	Access establishment & modification
§164.308(a)(5)(ii)(C)	Log-in monitoring — detect anomalies
§164.308(a)(5)(ii)(D)	Password management (NPRM: MFA required)
§164.308(a)(6)(ii)	Security incident response & reporting
§164.308(a)(7)(ii)(A)	Data backup plan — tested
§164.308(a)(7)(ii)(B)	Disaster recovery plan — restore ≤72h (NPRM)
§164.308(a)(7)(ii)(D)	Contingency plan testing ≥12mo
§164.308(a)(8)	Technical evaluation (this report) — annual pentest + biannual vuln scan (NPRM)

Physical safeguards (§164.310) — out of scope for this automated scan (cloud-inherited or workforce-level). Operator tracks separately.

Technical safeguards (§164.312) — the heart of this pentest.

Control	Requirement	Status	Evidence	Gap
§164.312(a)(1)	Unique user identification
§164.312(a)(2)(ii)	Emergency access procedure
§164.312(a)(2)(iii)	Automatic logoff / session timeout
§164.312(a)(2)(iv)	Encryption / decryption of ePHI at rest (NPRM: required)
§164.312(b)	Audit controls — record & examine activity
§164.312(c)(1)	Integrity — protect ePHI from improper alteration/destruction
§164.312(c)(2)	Mechanism to authenticate ePHI (NPRM)
§164.312(d)	Person or entity authentication (NPRM: MFA required)
§164.312(e)(1)	Transmission security
§164.312(e)(2)(i)	Integrity controls in transit
§164.312(e)(2)(ii)	Encryption in transit (NPRM: required)

Organizational / BA contracts (§164.314) — paper controls, out of scope for this automated scan. Operator tracks separately.

Administrative (§164.308) non-technical rows — rows like workforce authorization, termination procedures, risk analysis documentation are paper controls. The scanner only fills rows where it has direct technical evidence (e.g. §164.308(a)(5)(ii)(D) MFA via JWT claim inspection, §164.308(a)(1)(ii)(D) via audit_logs freshness). Mark the rest N/S (out of automated scope) — do not mark them Pass/Fail from assumption.

Status values: Pass / Partial / Fail / N/S (out of automated scope). Every Partial / Fail must link to a finding in §6. N/S rows do not require evidence — they're tracked outside this report.

Known recurring findings (verify each run — patches may land between runs)

Migration target: any row whose re-test is a one-line grep belongs in make lint (as a custom semgrep rule or a tiny pytest) rather than in a weekly pentest. When a regression check drops into the pentest twice in a row and has a mechanical verifier, file a task to promote it to CI and remove the row here. The pentest should focus on things a static check can't catch (live config, cross-tenant behavior, deployed-image vs source drift).

Findings seen in past runs	Re-test
`ext_auth.py` `verify_token` missing `audience=`	grep `ext_auth.py` for `verify_token(` — confirm `audience=` kwarg present
`AuditService()` instantiated without DB → `logger.info` only	grep `get_audit_service` and `_persist` — confirm a Postgres write path
`sessions.py:upload_audio` reads before size check	grep `upload_audio` for `await .*\.read()` — confirm chunked reads
`/api/users/me/mfa-enrolled` trusts client	grep `mfa-enrolled` for Firebase Admin MFA verification
`restrict_signups=false` on this deployment	`gcloud run services describe pablo-backend --region=us-central1` env vars + live `signUp` probe

Cleanup checklist (do before writing the report)

Delete any test users: curl -X POST "identitytoolkit /accounts:delete" -d '{"idToken":"<tok>"}' (sign in to get a fresh idToken first).
pkill -f "cloud-sql-proxy --port 15433" (or whatever port you used).
Confirm no UPDATE/DELETE/INSERT was issued (review your psql commands).
Redact any PHI names/emails/DOBs from evidence strings in the report.