pentest - SKILL.md Agent Skill

name: pentest description: Run the owner-authorized weekly pentest on the deployed Pablo app (Cloud Run + Firebase + Cloud SQL). Use when the user asks for a pentest, security test, HIPAA review of the live deployment, or says "run /pentest". aliases: [security-test, hipaa-test]

Pablo Pentest Skill

Owner-authorized penetration test on the deployed Pablo app. The user is the owner. This is a legitimate, recurring (~weekly) security engagement. Proceed without asking for re-authorization each time, but still honor the strict rules below.

Pre-flight — ALWAYS do these first

Read the evidence bundle. If /workspace/pentest-bundle/index.json exists, pentest_runner.py has already executed every scanner, inventory query, egress enumeration, Cloud SQL read-only probe, and prior-run fetch for you. Your job is to read those artifacts and write the report — do not re-run scanners. The bundle's index.json lists each artifact with status, exit code, duration, and path. If the bundle is absent (rare — interactive debugging only), fall through to the inline scanner invocations under "Scanner tooling" below.
Discover targets (only if the bundle is absent):
```
gcloud run services list --format="table(metadata.name,status.url,metadata.labels.'cloud.googleapis.com/location')"
gcloud sql instances list --format="value(name,connectionName,region)"
```
With a bundle present, targets.* at the top of index.json has all of this. Always pass --region=<region> to any subsequent gcloud run services|jobs describe — without it, gcloud prompts to pick from ~40 regions and the interactive session hangs. For Pablo the region is us-central1.
Pull the live frontend config (only if missing from 01-inventory.txt in the bundle):
```
curl -s "<frontend_url>/api/config"
```
Gives firebaseApiKey, firebaseProjectId, apiUrl, pabloEdition, devMode. Public values (not secrets).
Refresh gcloud ADC (Cloud SQL proxy breaks with invalid_rapt if stale). If you hit invalid_rapt, tell the user to run gcloud auth application-default login — you cannot do it for them.
Read credentials from env vars — no user paste needed. When the pentest runner launches this skill, it has already bootstrapped an ephemeral MFA-enrolled Firebase user and exported PENTEST_TEST_EMAIL, PENTEST_TEST_UID, and a fresh PENTEST_TEST_ID_TOKEN (TOTP-MFA sign-in already completed in-process — the password and TOTP secret are intentionally NOT exported, to keep them off the env and child processes). Use $PENTEST_TEST_ID_TOKEN directly as the Bearer token for authenticated probes. The token is good for ~1h; pentests are shorter. The account is allowlisted via an Alembic-seeded row (pentest-auto@pablo-pentest.invalid, RFC 2606 reserved TLD); it is not a real customer and carries no practice assignment, so §6 (cross-tenant IDOR) still needs Cloud SQL-sourced foreign UUIDs. If the env vars are absent (interactive debugging outside the runner), fall back to owner-pasted credentials.
Scope note — this report only claims what it can technically observe. Paper controls (BAAs, signed policies, workforce training records, contingency plan documents) are out of scope for automated scanning. The companion document docs/compliance/hipaa-security-rule.md (per-control narrative) is the source of truth for those; this report's §7 control matrix marks them N/S (paper control — see narrative) and does not assert their status. Egress destinations and deployed service config are technically observable and stay in scope.

Strict rules (don't violate)

Rate limits are scoped by target class:
- Own Cloud Run services (*.run.app frontend/backend): ≤20 req/sec steady, short bursts up to 50 req/sec OK for scanner sweeps. Total run budget ≤30,000 requests across all tools.
- Firebase Auth (identitytoolkit.googleapis.com): strict — ≤1 req/sec, ≤10 bad-password attempts total across the whole run. Firebase anti-abuse locks accounts fast.
- Third-party infra (Google infra outside your Cloud Run, Firebase internals, Anthropic, dependencies): zero — do not probe.
No DoS. A sustained burst that trips Cloud Armor or drives Cloud Run autoscaling into new instances is itself a finding — stop and report, do not keep pushing. Never use scanner flags like -t 100 / --threads 10 / -rate 200.
Read-only against Cloud SQL directly. No UPDATE / DELETE / INSERT statements through the psql / cloud-sql-proxy path. The DB-side assertion is: if the pentest touched the DB, it did so only via SELECT.
Writes through the authenticated API are allowed — with cleanup. You may create test patients and therapy sessions via the normal authenticated POST endpoints in order to exercise CRUD paths. Constraints:
- Prefix the first name with PENTEST- and the last name with a run UUID (e.g. PENTEST-ab12cd34) so cleanup is deterministic.
- Never upload real transcript content or real PHI. Synthetic placeholder strings only (e.g. "synthetic test transcript for pentest run ab12cd34").
- At the end of the run, delete every patient/session/appointment you created via the authenticated DELETE endpoints. Final step of every run must be a cleanup pass.
- If cleanup fails partway, report the un-deleted IDs in the findings so the owner can sweep manually.
Test users: if you create test users (identitytoolkit:signUp), delete them at the end via identitytoolkit /accounts:delete. Prefer pre-provisioned test accounts passed in by the owner — only mint new ones when the test explicitly needs a fresh identity.
Exploit to PoC only — one clear reproduction, then stop. No further pivoting.
Stay on *.run.app services belonging to this app and its Cloud SQL instance. Do not probe Google / Firebase infra itself, Anthropic, or other third parties.
No stored XSS payloads. Reflected-only, in your own session.
Redact any PHI/PII in the report to <REDACTED>.
Stop on lockout / WAF / rate-limit and report what you have.

Scanner tooling (pre-installed — use sanctioned invocations)

The pentest container ships with rate-limited scanners. The flags below are mandatory — don't drop them.

nuclei — template-driven vuln/misconfig scan against the backend:

nuclei -u "<backend_url>" \
  -t cves/ -t exposures/ -t misconfiguration/ -t http/default-logins/ \
  -rate-limit 20 -c 10 \
  -severity medium,high,critical \
  -exclude-tags dos,fuzz,intrusive \
  -o /tmp/nuclei.txt

ffuf — endpoint discovery:

ffuf -u "<backend_url>/api/FUZZ" \
  -w /usr/share/seclists/Discovery/Web-Content/common.txt \
  -rate 20 -t 5 -mc 200,201,301,302,401,403 \
  -of json -o /tmp/ffuf.json

sqlmap — only on a specific parameter you already suspect. Never blanket-run:

sqlmap -u "<backend_url>/api/endpoint?param=1" \
  --headers="Authorization: Bearer $ID_TOKEN" \
  --delay 0.5 --threads 1 --level 1 --risk 1 \
  --technique=BT --batch --flush-session --output-dir=/tmp/sqlmap

nikto — narrow tunings:

nikto -h "<backend_url>" -Tuning 2,3,4,5 -maxtime 5m -o /tmp/nikto.txt

testssl.sh — TLS posture:

testssl.sh --severity MEDIUM --color 0 "<backend_url>" > /tmp/testssl.txt

semgrep — static analysis over the baked-in backend source (/app/backend). Feed results into the Static analysis findings section of the report; review each hit manually before elevating.

semgrep scan \
  --config=p/owasp-top-ten --config=p/python --config=p/security-audit \
  --severity=ERROR --severity=WARNING \
  --json --output=/tmp/semgrep.json --metrics=off --quiet \
  /app/backend
jq '.results | group_by(.check_id) | map({rule: .[0].check_id, count: length, paths: [.[].path] | unique})' /tmp/semgrep.json

Playwright + Chromium — JS-console capture, DOM XSS PoC, CSP verification. The global node_modules is symlinked at /node_modules, so any .mjs you drop in /tmp or /workspace can import 'playwright' with no extra setup (ESM ignores NODE_PATH; it walks up from the script looking for node_modules/ and hits /node_modules).

// /tmp/dom_probe.mjs — run: node /tmp/dom_probe.mjs
import { chromium } from 'playwright';
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
page.on('console', msg => console.log(`CONSOLE ${msg.type()}: ${msg.text()}`));
page.on('pageerror', err => console.log(`PAGEERROR: ${err.message}`));
await page.goto(process.env.TARGET_URL);
await page.waitForLoadState('networkidle');
const resp = await page.request.get(process.env.TARGET_URL);
console.log('CSP:', resp.headers()['content-security-policy']);
console.log('HSTS:', resp.headers()['strict-transport-security']);
await browser.close();

Auth flow — Firebase MFA sign-in (bash)

Firebase MFA sign-in is three calls. Use this pattern; the TOTP window matters (Firebase rejects replays).

API_KEY="<from /api/config firebaseApiKey>"
SECRET="<TOTP b32 secret for this account>"

# Wait until the start of a fresh TOTP window to avoid a wasted attempt
until [ $((30 - $(date +%s) % 30)) -lt 3 ]; do sleep 2; done
sleep 3
CODE=$(oathtool --totp -b "$SECRET")

resp=$(curl -s -X POST "https://identitytoolkit.googleapis.com/v1/accounts:signInWithPassword?key=$API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"email":"<email>","password":"<pw>","returnSecureToken":true}')
MFA_CRED=$(echo "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin)['mfaPendingCredential'])")
ENROLLMENT_ID=$(echo "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin)['mfaInfo'][0]['mfaEnrollmentId'])")

# Finalize — MUST include both mfaPendingCredential AND mfaEnrollmentId
curl -s -X POST "https://identitytoolkit.googleapis.com/v2/accounts/mfaSignIn:finalize?key=$API_KEY" \
  -H "Content-Type: application/json" \
  -d "{\"mfaPendingCredential\":\"$MFA_CRED\",\"mfaEnrollmentId\":\"$ENROLLMENT_ID\",\"totpVerificationInfo\":{\"verificationCode\":\"$CODE\"}}" \
  > /tmp/signin.json

ID_TOKEN=$(python3 -c "import json; print(json.load(open('/tmp/signin.json'))['idToken'])")

Gotchas learned the hard way:

mfaSignIn:finalize requires BOTH mfaPendingCredential and mfaEnrollmentId. Omitting mfaEnrollmentId returns a confusing INVALID_ARGUMENT.
A consumed TOTP code → INVALID_CODE. The Bash tool blocks long leading sleeps; use an until [ ... ]; do sleep 2; done loop to wait for the next window.
bash pre-flight cannot do sleep 32 directly — it's blocked. Use the until-loop pattern.

Secrets lookup (GCP)

Secret Manager in the Pablo GCP project is authoritative. Names observed: pablo-database-url, pablo-db-password, AUTH_SECRET, JWT_SECRET_KEY, GOOGLE_CLIENT_ID/SECRET, AUTH_COOKIE_SIGNATURE_KEY. Project id is whatever is in /api/config.firebaseProjectId (e.g., pablohealth-oss).

gcloud secrets list --project=<project>
gcloud secrets versions access latest --secret=pablo-database-url --project=<project>
# parse: postgresql://pablo:<pw>@/pablo?host=/cloudsql/<conn>

Cloud SQL direct access

# Use a high port — the user may have another cloud-sql-proxy running on 5433
cloud-sql-proxy --port 15433 <connection> > /tmp/p.log 2>&1 &
# wait until "ready for new connections" appears in /tmp/p.log
DB_PW=$(gcloud secrets versions access latest --secret=pablo-db-password --project=<project>)
PGPASSWORD="$DB_PW" psql "host=127.0.0.1 port=15433 user=pablo dbname=pablo sslmode=disable" -c "\dt"

Always run read-only at the DB level. SELECT ... LIMIT 5. Queries to prioritize:

\dt — confirm audit_logs exists and has recent rows (SELECT count(*) FROM audit_logs WHERE timestamp > now() - interval '1 day';).
- Disambiguate row count == 0 with the closed-loop result before flagging. collect_closed_loop_audit (artifact 40-closed-loop-audit.txt / summary findings_count, highest_severity) actively writes one row in the same run and asserts it appears via /api/users/me/audit-log. Decision matrix:
  - closed-loop ok / NONE + row count 0 → INFORMATIONAL (idle period; pipeline proven alive). Goes in §6a as a positive control, not §6.
  - closed-loop ok / NONE + row count >0 → pass; nothing to flag.
  - closed-loop error/skipped + row count 0 → HIGH (pipeline unverified AND historical data empty). Real § 164.312(b) failure.
  - closed-loop error/skipped + row count >0 → MEDIUM (write-path unverified this run, but historical data shows activity).
- Never flag audit_logs empty as HIGH on closed-loop-green evidence — that's the false-positive class documented in 2026-05-03_pentest_report.md PABLO-002.
Confirm audit schema is PHI-free: \d+ audit_logs should NOT show user_email, user_name, patient_name columns. If it does, that's a regression.
RLS policies: SELECT schemaname, tablename, policyname FROM pg_policies;
SHOW row_security; and per-connection SHOW app.current_user_id;
Scan for unexpectedly plaintext PHI columns (e.g., \d+ patients → SSN, DOB storage).

Kill the proxy with pkill -f "cloud-sql-proxy --port 15433" when done.

Vulnerability exception log (read every run)

docs/pentest/VULNERABILITY_EXCEPTIONS.md is the operator's documented-risk register for advisories that aren't patched within the standard SLA — required by § 164.308(a)(1)(ii)(B). Treat it as first-class evidence: it both suppresses known-and-accepted dep-scan hits in §6 and is itself a thing that can rot.

Find the file. Path varies by how the runner mounted the repo — try in order, stop at the first hit:

EXC_FILE=""
for p in /workspace/pentest-bundle/repo/docs/pentest/VULNERABILITY_EXCEPTIONS.md \
         /workspace/repo/docs/pentest/VULNERABILITY_EXCEPTIONS.md \
         /app/docs/pentest/VULNERABILITY_EXCEPTIONS.md \
         docs/pentest/VULNERABILITY_EXCEPTIONS.md; do
  [ -f "$p" ] && EXC_FILE="$p" && break
done

If no path resolves, that's a MEDIUM finding in §6 — the operator cannot demonstrate documented risk decisions for unpatched advisories. Do NOT silently treat absence as "no exceptions."

Per open entry (under ## Open), parse ### <advisory-id>, Severity, Status, Revisit by, and the entry's age. Age sources, in order:

An explicit Raised: field, if present (preferred — push the operator to add one when missing).

Git first-add date for the entry heading (fallback):

REPO_ROOT="$(git -C "$(dirname "$EXC_FILE")" rev-parse --show-toplevel)"
git -C "$REPO_ROOT" log --diff-filter=A --format=%aI -S"### CVE-2026-3219" -- "$EXC_FILE" | tail -1

The file's first-add date as the conservative upper bound if neither works.

Compute age_days = today - raised. Today is date -u +%Y-%m-%d.

Cross-reference into §6. For any §9 dep-scan hit (pip-audit, trivy, osv-scanner) whose advisory ID is listed under ## Open, do not promote it on dep-scan severity alone — record it as INFO with Status: documented-exception (see §9 Documented exceptions → <advisory-id>). That's the entire purpose of the log.

Staleness rule — HIGH finding when triggered. Any open entry with age_days > 30 ships in §6:

ID: PABLO-EXC-STALE-<advisory-id>
Title: Vulnerability exception stale (>30 days) — refresh or remediate: <advisory-id>
Severity: HIGH (§ 164.308(a)(1)(ii)(B) — documented risk decisions must be re-reviewed; >30 days without update means the decision is unverified against current upstream state)
Description: Exception was raised days ago. Either upstream now ships a fix (close the entry), the situation has changed (refresh Why not patched, Compensating control, and Revisit by with current rationale), or the risk acceptance is silently aging — none acceptable to OCR.
Remediation: Patch the dep if a fix exists, OR update the entry with current rationale and a new Revisit by. Bumping Revisit by alone without re-justifying does not clear this finding — the staleness check is on Raised:/git-add date, not on Revisit by. To reset the clock, add or update a Raised: field to today and document what was re-verified.
Owner: copy the entry's Owner (default Kurt Niemi).

If age_days ≤ 30 for every open entry → §6a positive control row: "Exception log fresh — N open, oldest days, all within 30-day refresh window."

Checklist (run in order — keep it updated as findings stabilize)

Recon — TLS (openssl s_client -tls1_1 should fail with alert 70), cert chain, security headers on frontend + backend, /api/config, /api/health, /docs//openapi.json (404 in prod — good), /api/ext/auth/seed-admin (404 expected — referenced but not implemented), robots.txt. Run testssl.sh for full TLS config coverage. 1b. Scripted sweep — run nuclei and ffuf against the backend URL (sanctioned invocations above). Review findings manually; fold HIGH/CRITICAL into the report. Skip nikto / sqlmap unless the sweep surfaces a lead. 1c. Static analysis + dependency scan — semgrep (code), pip-audit (Python deps), trivy image against the deployed backend tag, osv-scanner --recursive /app/backend, gitleaks detect --source /app/backend --no-git. Feeds §8 and §9 of the report. Any gitleaks hit = automatic CRITICAL finding in §6. 1d. Vulnerability exception log review — locate docs/pentest/VULNERABILITY_EXCEPTIONS.md (see "Vulnerability exception log" section above), parse every entry under ## Open, compute age_days per entry, cross-reference advisory IDs against the §1c dep-scan output, and apply the staleness rule (>30 days → HIGH §6 finding). Missing file → MEDIUM finding.
Cloud Run IAM — gcloud run services get-iam-policy <svc>. Both backend and frontend are allUsers → roles/run.invoker by design; auth is at app layer. Verify with unauth curl → expect 401 on /api/{users/me, patients, sessions, admin/users} and 403 "Service auth required" on /api/ext/auth/*.
CORS — preflight with Origin: https://evil.com must not be reflected. Legit origin = the frontend's Cloud Run URL; check allow_credentials: true is only paired with that.
Auth flow — decode JWT (alg=RS256, aud=<projectId>, firebase.sign_in_second_factor=totp). Tamper to alg=none → 401. Swap X-Tenant-ID header → ignored (tenant from JWT).
Unauth surface — /api/ext/auth/check-allowlist and check-status require Bearer. Recurring HIGH check: ext_auth.py:_verify_blocking_function_token must pass audience=<backend URL> to google.oauth2.id_token.verify_token. If the audience= kwarg is still missing, file it again.
Cross-tenant IDOR (A vs B — separate practice schemas) — do this autonomously, do NOT ask the user to create accounts. The runner has already provisioned MFA-enrolled ephemeral users for you on this run: tokens in $PENTEST_TEST_ID_TOKEN_A / $PENTEST_TEST_ID_TOKEN_B (emails in $PENTEST_TEST_EMAIL_A / _B, uids in $PENTEST_TEST_UID_A / _B). A and B sit in separate pentest tenants — this exercises the Postgres schema boundary, which is the outermost layer of the patient-access model. Exercise both:
- With $PENTEST_TEST_ID_TOKEN_A, POST one patient + one session through the authenticated endpoints (name prefix PENTEST-<run-uuid>) so there is a known-real A-owned record. Capture the returned UUIDs. The session endpoint emits a note when the upload completes; capture the note id from the session response (or from GET /api/patients/<id>/notes) into $PENTEST_TEST_NOTE_ID_A.
- With $PENTEST_TEST_ID_TOKEN_B, run the probes below against A's UUIDs. Expect 404 from the repo layer (NOT 403 — 403 leaks existence). A 200 is a cross-tenant BOLA = CRITICAL + Breach candidate § 164.402. A 403 is HIGH (info leak via presence signal).
  - GET /api/patients/<A's-patient-uuid>
  - GET /api/sessions/<A's-session-uuid>
  - GET /api/notes/<A's-note-uuid> — the SOAP/note endpoint moved off /api/soap-notes/... in pa-0nx (notes/sessions split). Live shape is /api/notes/{note_id} (see backend/app/routes/notes.py).
  - PATCH /api/notes/<A's-note-uuid> with body {"content_edited": {"plan": "pentest-<run-uuid>"}} — write-side IDOR.
  - POST /api/notes/<A's-note-uuid>/finalize with body {"quality_rating": 5} — finalize-IDOR (would flip the note's finalized_at if it succeeded).
  - POST /api/notes/<A's-note-uuid>/submit-export (no body) — export-IDOR (would queue another tenant's PHI for outbound export).
  - GET /api/appointments/<A's-appt-uuid> (if A created one). Note: GET /api/appointments list-by-range stays on the calendar-owner shape (the "my calendar" view), so the list endpoint is not patient-scoped — only the single-resource read is. A 200 on the patient-scoped probe MAY be acceptable if B somehow has a grant on A's patient, but in this scenario B is in a different tenant entirely so even the schema lookup must miss; treat 200 here as CRITICAL.
- Repeat write-side probes: PUT/DELETE on the same foreign UUIDs with $PENTEST_TEST_ID_TOKEN_B → expect 404. A 200/204 means a write IDOR — CRITICAL + Breach candidate.
- Probe X-Tenant-ID header swap on the same endpoints — the header must be ignored (tenant is taken from JWT).
Fallback — if A/B tokens are missing (identity bootstrap failed): note explicitly in §12 that "cross-tenant IDOR was not exercised this run — identity bootstrap returned no credentials". Do NOT fabricate results and do NOT attempt to mint new Firebase accounts inline — the runner owns that lifecycle.

6b. Same-tenant cross-clinician IDOR (A vs C — same schema, no patient_clinicians grant) — PR #170 unified every patient-scoped table (patients, notes, therapy_sessions, appointments) around the has_patient_access(patient_id, user_id) SQL function backed by the patient_clinicians table (see migrations 777b846ab944_patient_clinicians_table_and_access_* and 9dea1edf7fe0_drop_patients_user_id_in_favor_of_* in backend/alembic/versions/). The schema boundary doesn't help here — A and C share a Postgres schema, but C has no grant row on A's patient. This is the boundary the #170 IDOR was actually on; the cross-tenant block above is the outer-defense check, this is the load-bearing one. The runner provisions a third user for this scenario: token in $PENTEST_TEST_ID_TOKEN_C (email $PENTEST_TEST_EMAIL_C, uid $PENTEST_TEST_UID_C).

Reuse the same A-owned UUIDs from §6 ($PENTEST_TEST_NOTE_ID_A, A's patient + session UUIDs). Do not re-create resources for this block.
With $PENTEST_TEST_ID_TOKEN_C, run the same probe set as §6 against A's UUIDs:
- GET /api/patients/<A's-patient-uuid>
- GET /api/sessions/<A's-session-uuid>
- GET /api/notes/<A's-note-uuid>
- PATCH /api/notes/<A's-note-uuid> with body {"content_edited": {"plan": "pentest-<run-uuid>"}}
- POST /api/notes/<A's-note-uuid>/finalize with body {"quality_rating": 5}
- POST /api/notes/<A's-note-uuid>/submit-export (no body)
Expect 404 on every probe (consistent with the existence-non-leak rule). A 200 / 200-with-edit / 201-with-finalize / 200-with-queued-export is CRITICAL + Breach candidate § 164.402 — same severity tier as a cross-tenant breach, because the schema isolation did not save us and has_patient_access is the only remaining wall. Phrase the §6 entry as "same-tenant cross-clinician BOLA via has_patient_access bypass" and link back to PR #170 as the introducing change.
Skip /api/appointments/<A's-appt-uuid> for the §6b probe set: appointments deliberately keep a calendar-owner read shape on top of the patient-scoped grant (the "my calendar" view), so a same-tenant clinician may see the appointment summary if they hold any patient grant — won't happen in this scenario (C has zero grants), but the rubric is the patient probe is sufficient evidence and the appointment probe is noisy.

Fallback — if $PENTEST_TEST_ID_TOKEN_C is missing but A/B are present (the bootstrap is mid-rollout, or this skill is being driven against an older runner that pre-dates _C): note explicitly in §12 that "same-tenant cross-clinician IDOR was not exercised this run — _C credentials absent". Do NOT silently fold this back into the §6 cross-tenant numbers — they're different scenarios with different defense-in-depth assertions. 7. Privilege escalation — /api/admin/* with clinician token → ADMIN_REQUIRED. JWT tamper (won't work, Firebase verifies sig). 8. Injection (targeted) — 1–2 probes each on reflected XSS, SQLi (boolean+time on obvious params), SSRF on iCal feed_url (hostname allowlist + scheme check; don't bother with 169.254.169.254 — it's blocked by hostname). 9. Rate limiting — 5 bad passwords on the real account via Firebase signInWithPassword (Firebase handles lockout; don't exceed 10). Skip if prior run already tripped lockout. 10. Upload DoS — Content-Length: 2000000000 + tiny body header-only probe (reject on header = good). Recurring MEDIUM check: sessions.py:upload_audio should read in bounded chunks, not await file.read() before size check. 11. MFA integrity — POST /api/users/me/mfa-enrolled as a fresh non-MFA sign-up. Recurring MEDIUM: returns a new mfa_enrolled_at without verifying Firebase-side enrollment. Doesn't grant access (JWT claim still gates), but poisons compliance metrics. 12. Signup hygiene — identitytoolkit /accounts:signUp with @example.invalid should fail in a locked-down deployment. If it succeeds, restrict_signups=false — flag it. 13. Cloud SQL direct — see section above. audit_logs table existence is the key HIPAA check. The Cloud SQL collector also asserts two integrity properties on audit_logs: the app role holds no UPDATE/DELETE (append-only, §164.312(c)(1)) and no foreign key cascades a patient delete into audit rows (6-year retention, §164.530(j)). Findings surface in 30-cloud-sql.txt. 14. Cloud configuration depth — the runner pre-computes deterministic posture checks; read these artifacts and fold their findings into §6 / positive results into §6a / evidence into §7. Do not re-run the gcloud queries. - 50-sa-iam.txt — Cloud Run runtime service-account roles. Target: no roles/owner/roles/editor, no unjustified *.admin, not the shared default compute SA. (§164.308(a)(4)) - 51-cloud-sql-posture.txt — Cloud SQL TLS mode, public-IP/private-IP exposure, CMEK custody. Target: TLS-only, no open authorized network, Private IP (or documented proxy-only). Google-managed keys are recorded as INFO (addressable). (§164.312(e), §164.312(a)(2)(iv)) - 52-audit-log-config.txt — Cloud Audit Logs DATA_READ for Secret Manager + Cloud SQL. (§164.312(b), operator side) - 53-secret-rotation.txt — newest enabled version age per secret; flags any past the max-age threshold. (§164.308(a)(5)(ii)(D)) - 54-wif-scope.txt — Workload Identity Federation trust conditions. Target: repo-pinned (assertion.repository ==), not org-only. (supply-chain trust) - 55-ci-workflow-audit.txt — pull_request_target workflows that check out untrusted PR head (present only when the .github tree is in scope). - 56-image-provenance.txt — deployed image build-provenance presence (best-effort; skips cleanly when unavailable).

Deliverable — HIPAA-grade report format

Target audience: HHS OCR auditors, Pablo's owner/operator, and an external qualified pentester using this report as their scoping input. Every run produces the full artifact — there is no "lite" mode.

Never skip a finding. Every issue observed — INFO through CRITICAL — lands in §6 with a full write-up. Severity is how findings are ordered and highlighted, not a gate for inclusion. A "clean" run still enumerates the INFO-level observations it considered. §2 highlights CRITICAL/HIGH for the reader; nothing is dropped because it looked minor. If you considered an issue and decided it was a non-issue, that belongs in §6a (Positive controls and items tested clean) with the reasoning, not silently omitted.

Positive findings are first-class. §6a is mandatory — it captures the controls that passed and the things the assessor considered and dismissed, with the same evidence rigor as a finding (what was tested, what the expected behavior was, what was observed). This is what a Covered Entity shows an auditor to demonstrate that the technical safeguards are actually working, not just written down, and that the assessor looked at the full surface.

Read this disclaimer onto every report cover: this is an automated self-assessment driven by an LLM, not an independent qualified third-party pentest. For full HIPAA §164.308(a)(8) defensibility (2024 NPRM anticipated to finalize in 2026), Pablo should still engage an independent qualified pentester at least annually. This weekly artifact complements — does not replace — that engagement, and is meant to surface issues between formal engagements + give the external tester a scoped starting point.

Report sections, in order (use these exact headings):

1. Cover & scope

Covered entity (Pablo Health, LLC); report ID (PABLO-PENTEST-<date>-<run UUID>); reporting period (since prior run in GCS); tester identity (CLI + model); authorization statement; in-scope systems (Cloud Run × 2, Cloud SQL, Firebase, GCS compliance bucket, Secret Manager, backend source); out-of-scope (GCP/Firebase/Vertex infra, third-party source repos, physical, social engineering, wireless); run metadata (project ID, URLs, connection name, region).

2. Executive summary

3–5 bullets, business-risk language. Lead with severity totals and trend vs prior run.

3. Asset inventory & data flow

Observed external egress destinations — enumerate from code + config, then reason about each destination's BAA coverage. List:

Deployed service env vars that route inference: gcloud run services describe pablo-backend --region=us-central1 --format='value(spec.template.spec.containers[0].env)' → record CLAUDE_CODE_USE_VERTEX, GOOGLE_GENAI_USE_VERTEXAI, ANTHROPIC_VERTEX_PROJECT_ID, GOOGLE_CLOUD_LOCATION values.

All external hostnames the backend can reach:

grep -rEoh "https?://[a-zA-Z0-9.-]+" /app/backend | sort -u

For each destination the backend can send request bodies (prompts, transcripts, patient fields) to, determine whether it is covered by a Business Associate Agreement — Vertex AI under the project's Google Cloud BAA is covered; api.anthropic.com, api.openai.com, generativelanguage.googleapis.com (public Gemini direct) are not covered by the Google Cloud BAA. Apply the severity rubric below: ePHI traversing a non-BAA destination is a § 164.504(e) permitted-use violation and a § 164.402 Breach candidate — score per the rubric, do not pre-commit to a specific tier here.

4. Threat model (STRIDE-lite, Pablo-specific)

Actors: unauth internet; authenticated clinician cross-tenant; authenticated clinician horizontal BOLA within tenant; insider with GCP IAM; compromised dependency; compromised subprocessor. Crown jewels: patient records, session transcripts, audit_logs, Firebase/GCP credentials, AUTH_SECRET/JWT_SECRET_KEY. Map each actor to attack paths and to the findings in §6.

5. Methodology & frameworks

OWASP WSTG v4.2, OWASP API Security Top 10 (2023), OWASP ASVS 4.0 Level 2, PTES Technical Guidelines, NIST SP 800-115, HIPAA Security Rule §164.308/.310/.312/.314 (2024 NPRM). List tools from the scanner tooling section that were invoked with which flags.

6. Findings

Severity rubric — HIPAA overlay applies. Raw CVSS underweights PHI impact: a "medium" CVSS CVE that enables PHI access is a § 164.402 Breach and ships as CRITICAL. Use:

effective_severity = max(cvss_tier, phi_impact_tier)

where phi_impact_tier is:

CRITICAL — any unauthenticated PHI read; cross-tenant or horizontal PHI access by an authenticated user; PHI integrity compromise (write/delete across tenant boundary); PHI egress to infrastructure not covered by a Business Associate Agreement (§ 164.504(e) — the disclosure path itself is an impermissible use regardless of whether a Breach has occurred yet); any condition that would meet the § 164.402 "acquisition, access, use, or disclosure of PHI in a manner not permitted" definition of a reportable Breach.
HIGH — authenticated privilege escalation; auth-bypass that could lead to PHI without a second bug; PHI availability loss > 24h; any secret in git (gitleaks hit); missing/non-firing audit logs on PHI routes (§ 164.312(b) gap).
MEDIUM — info disclosure without direct PHI nexus; DoS vector; missing defense-in-depth on a PHI-adjacent path.
LOW — defense-in-depth gap with no PHI nexus.
INFO — observation / posture note.

"PHI-adjacent" means the path touches metadata, signals, or fields that would appear in a HIPAA compliance report, audit trail, or Breach risk-assessment, even when the underlying PHI is not itself exposed. Example: /api/users/me/mfa-enrolled poisons the mfa_enrolled_at field that feeds § 164.308(a)(5) workforce-MFA attestations — PHI-adjacent, MEDIUM. Counter-example: missing HSTS on a static-asset CDN with no session tokens — no PHI nexus, LOW.

Baseline examples (the raw CVSS column is what a pure-technical scorer would emit; effective is what ships):

Finding	CVSS tier	PHI tier	Effective	Note
Unauth `GET /api/patients/<id>` returns 200	High	Critical	Critical	§ 164.402 Breach — reportable
Authed clinician B reads tenant A's patient	High	Critical	Critical	Cross-tenant BOLA = Breach
Stored XSS in session notes	Medium	Critical	Critical	PHI integrity + exfiltration path
`audit_logs` silent gap > 24h on PHI route, closed-loop ALSO failed	Medium	High	High	§ 164.312(b) enforcement failure
`audit_logs` empty over 24h, closed-loop GREEN this run	Info	None	Informational	Idle-period evidence, not a finding (see §Cloud SQL guidance)
Stale npm dep, CVSS 7.5, no PHI reachability	High	None	High	No overlay needed
HSTS header missing	Low	None	Low	Defense-in-depth only
Backend routes inference to `api.anthropic.com` (not Vertex); transcripts contain PHI	Low (config)	Critical	Critical	PHI disclosed to non-BAA subprocessor = § 164.504(e) impermissible use + § 164.402 Breach candidate — `Breach candidate: § 164.402`

Compute and emit a CVSS 3.1 base vector for every finding (e.g., AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:N) alongside the effective HIPAA-overlay severity, so auditors can see both scores and the reasoning.

Breach-candidate flag. Any finding scored CRITICAL because of a PHI nexus (not because CVSS alone was Critical) must carry a Breach candidate: § 164.402 tag in its subsection header, with a sentence naming the specific disclosure path. This is what separates a HIPAA-aware report from a generic CVE dump, and it's the line an OCR reviewer looks for first.

Methodology reference: OWASP Risk Rating Methodology (business-impact multiplier), HITRUST CSF Risk Analysis Guide, HHS OCR "Guidance on Risk Analysis Requirements" under § 164.308(a)(1)(ii)(A). The max(cvss_tier, phi_impact_tier) formula is the common Coalfire/Clearwater pattern for HIPAA-aware pentest reports.

6a. Positive controls and items tested clean

Single evidence-rigor table — no parallel bullet list elsewhere. Columns: Control | What was tested | Expected behavior | Observed evidence | §164 control mapped. Draw from the checklist (§1–§13) — at minimum cover: TLS ≥1.2 enforcement, security headers (HSTS / CSP / X-Content-Type-Options), CORS origin allowlist, unauth 401 on PHI routes, JWT alg=none rejection, X-Tenant-ID header ignored, cross-tenant 404-not-403, admin-route ADMIN_REQUIRED enforcement, Cloud Run egress env vars pointing to Vertex (not public Gemini/Anthropic), audit_logs table present with recent rows, audit schema PHI-free, CI supply-chain scans passing on the deployed tag. Every row cites specific evidence (status code, header value, row count, screenshot-equivalent text output).

Also include controls the model considered and dismissed as non-issues — e.g. "/api/ext/auth/seed-admin returns 404 as expected, not a live backdoor despite setup-solo.sh reference." This is what shows the auditor the technical safeguards are working and the assessor looked at the full surface, not merely what was written down. Dismissed scanner false positives belong in §8 / §9 with the tool context, not here.

7. HIPAA Security Rule control matrix

Copy the table under HIPAA control matrix below into this section, filling Status / Evidence / Gap from this run's observations. Every row required; N/A rows need justification. Every Partial or Fail must link to a finding in §6.

8. Static analysis findings (semgrep)

Only confirmed-real hits, grouped by rule ID with file:line. Dismissed false positives stay here as a Dismissed: subsection with the reason — not duplicated in §6a or Appendix B.

9. Dependency & supply-chain scan

CI already runs pip-audit, trivy, and npm audit on every PR + weekly (see .github/workflows/ci.yml + security.yml). The pentest adds what CI can't: (a) scanning what's currently deployed (may lag main), (b) self-contained evidence inline in the report (auditors shouldn't need to cross-reference the GitHub Security tab). Flag any divergence between "CI clean" and "deployed tag vulnerable" as its own finding.

Subsections:

Python deps — pip-audit: Package | Installed | Vulnerable | Fixed | CVE | CVSS
Deployed container image — trivy image <deployed tag> where the tag comes from gcloud run services describe pablo-backend --region=us-central1 --format='value(spec.template.spec.containers[0].image)'
Multi-ecosystem — osv-scanner --recursive /app/backend
Secrets in git — gitleaks detect --source /app/backend --no-git (any hit = CRITICAL finding in §6)
Documented exceptions — table of every entry under ## Open in docs/pentest/VULNERABILITY_EXCEPTIONS.md. Columns: Advisory | Package | Severity | Status | Raised | Age (days) | Revisit by | Stale (>30d)?. Note the file path actually used (or "file not found" with the paths searched). For each row whose advisory ID also appears in pip-audit / trivy / osv-scanner output above, add a Cross-ref: line under the row pointing at that scanner's finding ID — and downgrade the §6 entry for that advisory to INFO with Status: documented-exception. Stale rows (Age > 30) get a separate HIGH §6 finding per the staleness rule.

10. Prior-run carry-over & trends

Fetch the newest .md before today from gs://<COMPLIANCE_REPORT_BUCKET>/pentest/. Diff into: Resolved since last run / Persisting (with "consecutive runs open" counter) / Regressions (elevate to HIGH minimum) / New this run. If no prior report exists, mark this as the baseline.

11. Endpoint coverage matrix

Group by auth profile, not by individual route. Columns: Auth profile | Tenant-scoped | Route count | Representative routes | Tests run | Result. One row per (auth_requirement, tenant_scope) combination — e.g. one row covering all require_mfa + tenant-scoped PHI routes, one for require_mfa + admin-only, one for get_current_user_no_mfa, one for public. List 2–3 representative routes per row; the exhaustive enumeration goes in Appendix B under endpoints.txt.

Any route that doesn't fit a known profile (missing Depends(), explicit public=true tag, unusual composition) gets its own row. Every unusual row needs either a positive test result or an explicit "deferred to external pentester" reason — never "ran out of time."

Enumerate the full set for the appendix:

grep -rnE "@router\.(get|post|put|delete|patch)\(" /app/backend/app/routes/ > /tmp/endpoints.txt

12. Automated assessment scope boundaries

Describe what is outside the boundary of this automated engagement and why, framed as scope constraints rather than gaps. The audience is auditors establishing what human-led testing should cover next; frame each entry as "requires human judgment / multi-session context / physical access / social engineering" rather than "we didn't test this." Entries belong here when they fall outside what any automated tool can assert — business-logic depth, multi-step stateful workflows, novel zero-days, prompt-injection depth, physical controls, social engineering. Do not list items that were simply skipped due to time; those are findings or fallback notes in the relevant section.

Static analysis exclusions (semgrep): The following paths are excluded from semgrep via --exclude flags (rationale mirrored in /.semgrepignore). A human reviewer or independent pentester should audit these paths directly:

Excluded path	Rule(s) suppressed	Rationale
`alembic/`	`avoid-sqlalchemy-text`	Migration DDL; `text()` calls are operator-authored SQL, never reachable from web requests
`tests_integration/`, `tests/`	`avoid-sqlalchemy-text`	Test fixture DDL; not on the production attack surface
`app/db/__init__.py`, `app/db/provisioning.py`	`avoid-sqlalchemy-text`	Tenant schema management DDL (`CREATE SCHEMA`, `SET search_path`, `pg_advisory_lock`, RLS policy setup); all `text()` arguments are system-generated, never from user input
`app/jobs/pentest_*.py`	`dynamic-urllib-use-detected`	Self-assessment tooling; dynamic outbound calls target IAM-gated GCP admin APIs and operator-configured webhook URLs
`app/jobs/hipaa_log_review.py`	`dynamic-urllib-use-detected`	Audit log reviewer; urllib targets an operator-configured webhook (admin-only runtime config)

Inline # nosemgrep annotations suppress individual false-positive hits where exclusion would be too broad (e.g. auth/service.py unverified-JWT routing helper, logger calls in auth handlers).

13. Prioritized remediation roadmap

Ordered list grouped by severity. Columns: Finding ID | Severity | Effort (S/M/L) | Target date | Owner | Retest by. The Retest by column defaults to "next scheduled run" — only override for CRITICAL items that need sooner re-verification (e.g. "next run + manual curl confirmation within 7 days"). The §15 retest-plan section was merged here; if a finding needs a bespoke retest procedure beyond "run this skill again," describe it inline in that finding's §6 subsection under a Retest procedure: line.

14. Appendices

The pentest_runner.py wrapper uploads every raw scanner artifact to gs://<COMPLIANCE_REPORT_BUCKET>/pentest/<run-uuid>/raw/ with retention lock. The report inlines findings-level output (the specific lines that drove a conclusion) and links to the GCS object for full raw dumps. Inline blocks are capped at ~50 lines each — anything longer, link out.

A: Commands executed — chronological list of every shell command/API call the run made (redact tokens and any PHI). One command per line. Audit defense: an OCR reviewer must be able to reconstruct what actually ran. This one stays fully inline — it's short and load-bearing.
B: Scanner invocations & findings — one subsection per tool (nuclei, ffuf, semgrep, pip-audit, trivy, osv-scanner, gitleaks, testssl.sh, nikto/sqlmap if invoked, plus the endpoints.txt enumeration from §11). For each: exact invocation, exit code, summary line counts (total / high / medium / low / info), inline excerpts of the specific lines that became §6 findings, and a Raw output: gs://.../raw/<tool>.txt link. Dismissed false positives live here with their dismissal reason — do not duplicate them in §6a.
C: Cloud SQL query log — every SELECT run against the DB and the row counts returned (values redacted). Stays fully inline.
D: SBOM — link to gs://.../raw/sbom.cyclonedx.json; inline a one-line summary (component count, critical CVE count). If the image scan is unavailable, note that and link to pip-list.txt instead.
E: Attestation block — tester identity (CLI + model), run UUID, ISO timestamp, SHA256 of the report body (excluding this block). Unsigned (automated run); needs human countersign before the operator uses it as input to the annual §164.314(a) written verification to Covered Entities.

Output handling: emit the complete markdown report — through all appendices — to stdout. Do NOT gsutil cp or otherwise upload to GCS yourself; the calling runner (pentest_runner.py) captures stdout and uploads to the retention-locked compliance bucket. Uploading from inside the skill creates duplicate objects with inconsistent metadata. The final thing you emit must be the closing of appendix E; do not append a trailing "uploaded to gs://…" line.

Length & tone: body 2000–3000 words; appendices on top of that. Inline evidence for anything ≤50 lines; link to the GCS raw-artifact bucket for longer dumps (see §14). Audit-ready neutral tone. Every "pass" claim cites observable evidence.

HIPAA control matrix (copy into §7 of every report)

Administrative safeguards (§164.308)

Control	Requirement	Status	Evidence	Gap
§164.308(a)(1)(ii)(A)	Risk analysis — accurate, thorough, documented; reviewed ≥12mo
§164.308(a)(1)(ii)(B)	Risk management — reduce risks to reasonable level
§164.308(a)(1)(ii)(D)	Information system activity review — logs, access reports, incident reports
§164.308(a)(3)(i)	Workforce authorization / supervision
§164.308(a)(3)(ii)(C)	Termination procedures — revoke access on departure
§164.308(a)(4)(ii)(B)	Access authorization — granted per role
§164.308(a)(4)(ii)(C)	Access establishment & modification
§164.308(a)(5)(ii)(C)	Log-in monitoring — detect anomalies
§164.308(a)(5)(ii)(D)	Password management (NPRM: MFA required)
§164.308(a)(6)(ii)	Security incident response & reporting
§164.308(a)(7)(ii)(A)	Data backup plan — tested
§164.308(a)(7)(ii)(B)	Disaster recovery plan — restore ≤72h (NPRM)
§164.308(a)(7)(ii)(D)	Contingency plan testing ≥12mo
§164.308(a)(8)	Technical evaluation (this report) — annual pentest + biannual vuln scan (NPRM)

Physical safeguards (§164.310) — out of scope for this automated scan (cloud-inherited or workforce-level). Operator tracks separately.

Technical safeguards (§164.312) — the heart of this pentest.

Control	Requirement	Status	Evidence	Gap
§164.312(a)(1)	Unique user identification
§164.312(a)(2)(ii)	Emergency access procedure
§164.312(a)(2)(iii)	Automatic logoff / session timeout
§164.312(a)(2)(iv)	Encryption / decryption of ePHI at rest (NPRM: required)
§164.312(b)	Audit controls — record & examine activity
§164.312(c)(1)	Integrity — protect ePHI from improper alteration/destruction
§164.312(c)(2)	Mechanism to authenticate ePHI (NPRM)
§164.312(d)	Person or entity authentication (NPRM: MFA required)
§164.312(e)(1)	Transmission security
§164.312(e)(2)(i)	Integrity controls in transit
§164.312(e)(2)(ii)	Encryption in transit (NPRM: required)

Organizational / BA contracts (§164.314) — paper controls, out of scope for this automated scan. Operator tracks separately.

Administrative (§164.308) non-technical rows — rows like workforce authorization, termination procedures, risk analysis documentation are paper controls. The scanner only fills rows where it has direct technical evidence (e.g. §164.308(a)(5)(ii)(D) MFA via JWT claim inspection, §164.308(a)(1)(ii)(D) via audit_logs freshness). Mark the rest N/S (out of automated scope) — do not mark them Pass/Fail from assumption.

Status values: Pass / Partial / Fail / N/S (out of automated scope). Every Partial / Fail must link to a finding in §6. N/S rows do not require evidence — they're tracked outside this report.

Known recurring findings (verify each run — patches may land between runs)

Migration target: any row whose re-test is a one-line grep belongs in make lint (as a custom semgrep rule or a tiny pytest) rather than in a weekly pentest. When a regression check drops into the pentest twice in a row and has a mechanical verifier, file a task to promote it to CI and remove the row here. The pentest should focus on things a static check can't catch (live config, cross-tenant behavior, deployed-image vs source drift).

Findings seen in past runs	Re-test
`ext_auth.py` `verify_token` missing `audience=`	grep `ext_auth.py` for `verify_token(` — confirm `audience=` kwarg present
`AuditService()` instantiated without DB → `logger.info` only	grep `get_audit_service` and `_persist` — confirm a Postgres write path
`sessions.py:upload_audio` reads before size check	grep `upload_audio` for `await .*\.read()` — confirm chunked reads
`/api/users/me/mfa-enrolled` trusts client	grep `mfa-enrolled` for Firebase Admin MFA verification
`restrict_signups=false` on this deployment	`gcloud run services describe pablo-backend --region=us-central1` env vars + live `signUp` probe

Cleanup checklist (do before writing the report)

Delete any test users: curl -X POST "identitytoolkit /accounts:delete" -d '{"idToken":"<tok>"}' (sign in to get a fresh idToken first).
pkill -f "cloud-sql-proxy --port 15433" (or whatever port you used).
Confirm no UPDATE/DELETE/INSERT was issued (review your psql commands).
Redact any PHI names/emails/DOBs from evidence strings in the report.