name: pentest description: Run the owner-authorized weekly pentest on the deployed Pablo app (Cloud Run + Firebase + Cloud SQL). Use when the user asks for a pentest, security test, HIPAA review of the live deployment, or says "run /pentest". aliases: [security-test, hipaa-test]
Pablo Pentest Skill
Owner-authorized penetration test on the deployed Pablo app. The user is the owner. This is a legitimate, recurring (~weekly) security engagement. Proceed without asking for re-authorization each time, but still honor the strict rules below.
Pre-flight — ALWAYS do these first
- Read the evidence bundle. If
/workspace/pentest-bundle/index.jsonexists,pentest_runner.pyhas already executed every scanner, inventory query, egress enumeration, Cloud SQL read-only probe, and prior-run fetch for you. Your job is to read those artifacts and write the report — do not re-run scanners. The bundle'sindex.jsonlists each artifact with status, exit code, duration, and path. If the bundle is absent (rare — interactive debugging only), fall through to the inline scanner invocations under "Scanner tooling" below. - Discover targets (only if the bundle is absent):
With a bundle present,gcloud run services list --format="table(metadata.name,status.url,metadata.labels.'cloud.googleapis.com/location')" gcloud sql instances list --format="value(name,connectionName,region)"targets.*at the top ofindex.jsonhas all of this. Always pass--region=<region>to any subsequentgcloud run services|jobs describe— without it, gcloud prompts to pick from ~40 regions and the interactive session hangs. For Pablo the region isus-central1. - Pull the live frontend config (only if missing from
01-inventory.txtin the bundle):
Givescurl -s "<frontend_url>/api/config"firebaseApiKey,firebaseProjectId,apiUrl,pabloEdition,devMode. Public values (not secrets). - Refresh gcloud ADC (Cloud SQL proxy breaks with
invalid_raptif stale). If you hitinvalid_rapt, tell the user to rungcloud auth application-default login— you cannot do it for them. - Read credentials from env vars — no user paste needed. When the
pentest runner launches this skill, it has already bootstrapped an
ephemeral MFA-enrolled Firebase user and exported
PENTEST_TEST_EMAIL,PENTEST_TEST_UID, and a freshPENTEST_TEST_ID_TOKEN(TOTP-MFA sign-in already completed in-process — the password and TOTP secret are intentionally NOT exported, to keep them off the env and child processes). Use$PENTEST_TEST_ID_TOKENdirectly as the Bearer token for authenticated probes. The token is good for ~1h; pentests are shorter. The account is allowlisted via an Alembic-seeded row (pentest-auto@pablo-pentest.invalid, RFC 2606 reserved TLD); it is not a real customer and carries no practice assignment, so §6 (cross-tenant IDOR) still needs Cloud SQL-sourced foreign UUIDs. If the env vars are absent (interactive debugging outside the runner), fall back to owner-pasted credentials. - Scope note — this report only claims what it can technically observe. Paper controls (BAAs, signed policies, workforce training records, contingency plan documents) are out of scope for automated scanning. The companion document
docs/compliance/hipaa-security-rule.md(per-control narrative) is the source of truth for those; this report's §7 control matrix marks themN/S (paper control — see narrative)and does not assert their status. Egress destinations and deployed service config are technically observable and stay in scope.
Strict rules (don't violate)
- Rate limits are scoped by target class:
- Own Cloud Run services (
*.run.appfrontend/backend): ≤20 req/sec steady, short bursts up to 50 req/sec OK for scanner sweeps. Total run budget ≤30,000 requests across all tools. - Firebase Auth (
identitytoolkit.googleapis.com): strict — ≤1 req/sec, ≤10 bad-password attempts total across the whole run. Firebase anti-abuse locks accounts fast. - Third-party infra (Google infra outside your Cloud Run, Firebase internals, Anthropic, dependencies): zero — do not probe.
- Own Cloud Run services (
- No DoS. A sustained burst that trips Cloud Armor or drives Cloud Run autoscaling into new instances is itself a finding — stop and report, do not keep pushing. Never use scanner flags like
-t 100/--threads 10/-rate 200. - Read-only against Cloud SQL directly. No
UPDATE/DELETE/INSERTstatements through the psql / cloud-sql-proxy path. The DB-side assertion is: if the pentest touched the DB, it did so only viaSELECT. - Writes through the authenticated API are allowed — with cleanup. You may create test patients and therapy sessions via the normal authenticated POST endpoints in order to exercise CRUD paths. Constraints:
- Prefix the first name with
PENTEST-and the last name with a run UUID (e.g.PENTEST-ab12cd34) so cleanup is deterministic. - Never upload real transcript content or real PHI. Synthetic placeholder strings only (e.g.
"synthetic test transcript for pentest run ab12cd34"). - At the end of the run, delete every patient/session/appointment you created via the authenticated DELETE endpoints. Final step of every run must be a cleanup pass.
- If cleanup fails partway, report the un-deleted IDs in the findings so the owner can sweep manually.
- Prefix the first name with
- Test users: if you create test users (
identitytoolkit:signUp), delete them at the end viaidentitytoolkit /accounts:delete. Prefer pre-provisioned test accounts passed in by the owner — only mint new ones when the test explicitly needs a fresh identity. - Exploit to PoC only — one clear reproduction, then stop. No further pivoting.
- Stay on
*.run.appservices belonging to this app and its Cloud SQL instance. Do not probe Google / Firebase infra itself, Anthropic, or other third parties. - No stored XSS payloads. Reflected-only, in your own session.
- Redact any PHI/PII in the report to
<REDACTED>. - Stop on lockout / WAF / rate-limit and report what you have.
Scanner tooling (pre-installed — use sanctioned invocations)
The pentest container ships with rate-limited scanners. The flags below are mandatory — don't drop them.
nuclei — template-driven vuln/misconfig scan against the backend:
nuclei -u "<backend_url>" \
-t cves/ -t exposures/ -t misconfiguration/ -t http/default-logins/ \
-rate-limit 20 -c 10 \
-severity medium,high,critical \
-exclude-tags dos,fuzz,intrusive \
-o /tmp/nuclei.txt
ffuf — endpoint discovery:
ffuf -u "<backend_url>/api/FUZZ" \
-w /usr/share/seclists/Discovery/Web-Content/common.txt \
-rate 20 -t 5 -mc 200,201,301,302,401,403 \
-of json -o /tmp/ffuf.json
sqlmap — only on a specific parameter you already suspect. Never blanket-run:
sqlmap -u "<backend_url>/api/endpoint?param=1" \
--headers="Authorization: Bearer $ID_TOKEN" \
--delay 0.5 --threads 1 --level 1 --risk 1 \
--technique=BT --batch --flush-session --output-dir=/tmp/sqlmap
nikto — narrow tunings:
nikto -h "<backend_url>" -Tuning 2,3,4,5 -maxtime 5m -o /tmp/nikto.txt
testssl.sh — TLS posture:
testssl.sh --severity MEDIUM --color 0 "<backend_url>" > /tmp/testssl.txt
semgrep — static analysis over the baked-in backend source (/app/backend). Feed results into the Static analysis findings section of the report; review each hit manually before elevating.
semgrep scan \
--config=p/owasp-top-ten --config=p/python --config=p/security-audit \
--severity=ERROR --severity=WARNING \
--json --output=/tmp/semgrep.json --metrics=off --quiet \
/app/backend
jq '.results | group_by(.check_id) | map({rule: .[0].check_id, count: length, paths: [.[].path] | unique})' /tmp/semgrep.json
Playwright + Chromium — JS-console capture, DOM XSS PoC, CSP verification. The global node_modules is symlinked at /node_modules, so any .mjs you drop in /tmp or /workspace can import 'playwright' with no extra setup (ESM ignores NODE_PATH; it walks up from the script looking for node_modules/ and hits /node_modules).
// /tmp/dom_probe.mjs — run: node /tmp/dom_probe.mjs
import { chromium } from 'playwright';
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
page.on('console', msg => console.log(`CONSOLE ${msg.type()}: ${msg.text()}`));
page.on('pageerror', err => console.log(`PAGEERROR: ${err.message}`));
await page.goto(process.env.TARGET_URL);
await page.waitForLoadState('networkidle');
const resp = await page.request.get(process.env.TARGET_URL);
console.log('CSP:', resp.headers()['content-security-policy']);
console.log('HSTS:', resp.headers()['strict-transport-security']);
await browser.close();
Auth flow — Firebase MFA sign-in (bash)
Firebase MFA sign-in is three calls. Use this pattern; the TOTP window matters (Firebase rejects replays).
API_KEY="<from /api/config firebaseApiKey>"
SECRET="<TOTP b32 secret for this account>"
# Wait until the start of a fresh TOTP window to avoid a wasted attempt
until [ $((30 - $(date +%s) % 30)) -lt 3 ]; do sleep 2; done
sleep 3
CODE=$(oathtool --totp -b "$SECRET")
resp=$(curl -s -X POST "https://identitytoolkit.googleapis.com/v1/accounts:signInWithPassword?key=$API_KEY" \
-H "Content-Type: application/json" \
-d '{"email":"<email>","password":"<pw>","returnSecureToken":true}')
MFA_CRED=$(echo "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin)['mfaPendingCredential'])")
ENROLLMENT_ID=$(echo "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin)['mfaInfo'][0]['mfaEnrollmentId'])")
# Finalize — MUST include both mfaPendingCredential AND mfaEnrollmentId
curl -s -X POST "https://identitytoolkit.googleapis.com/v2/accounts/mfaSignIn:finalize?key=$API_KEY" \
-H "Content-Type: application/json" \
-d "{\"mfaPendingCredential\":\"$MFA_CRED\",\"mfaEnrollmentId\":\"$ENROLLMENT_ID\",\"totpVerificationInfo\":{\"verificationCode\":\"$CODE\"}}" \
> /tmp/signin.json
ID_TOKEN=$(python3 -c "import json; print(json.load(open('/tmp/signin.json'))['idToken'])")
Gotchas learned the hard way:
mfaSignIn:finalizerequires BOTHmfaPendingCredentialandmfaEnrollmentId. OmittingmfaEnrollmentIdreturns a confusingINVALID_ARGUMENT.- A consumed TOTP code →
INVALID_CODE. The Bash tool blocks long leadingsleeps; use anuntil [ ... ]; do sleep 2; doneloop to wait for the next window. bashpre-flight cannot dosleep 32directly — it's blocked. Use the until-loop pattern.
Secrets lookup (GCP)
Secret Manager in the Pablo GCP project is authoritative. Names observed:
pablo-database-url, pablo-db-password, AUTH_SECRET, JWT_SECRET_KEY, GOOGLE_CLIENT_ID/SECRET, AUTH_COOKIE_SIGNATURE_KEY. Project id is whatever is in /api/config.firebaseProjectId (e.g., pablohealth-oss).
gcloud secrets list --project=<project>
gcloud secrets versions access latest --secret=pablo-database-url --project=<project>
# parse: postgresql://pablo:<pw>@/pablo?host=/cloudsql/<conn>
Cloud SQL direct access
# Use a high port — the user may have another cloud-sql-proxy running on 5433
cloud-sql-proxy --port 15433 <connection> > /tmp/p.log 2>&1 &
# wait until "ready for new connections" appears in /tmp/p.log
DB_PW=$(gcloud secrets versions access latest --secret=pablo-db-password --project=<project>)
PGPASSWORD="$DB_PW" psql "host=127.0.0.1 port=15433 user=pablo dbname=pablo sslmode=disable" -c "\dt"
Always run read-only at the DB level. SELECT ... LIMIT 5. Queries to prioritize:
\dt— confirmaudit_logsexists and has recent rows (SELECT count(*) FROM audit_logs WHERE timestamp > now() - interval '1 day';). A long gap in audit activity is itself a HIGH finding.- Confirm audit schema is PHI-free:
\d+ audit_logsshould NOT showuser_email,user_name,patient_namecolumns. If it does, that's a regression. - RLS policies:
SELECT schemaname, tablename, policyname FROM pg_policies; SHOW row_security;and per-connectionSHOW app.current_user_id;- Scan for unexpectedly plaintext PHI columns (e.g.,
\d+ patients→ SSN, DOB storage).
Kill the proxy with pkill -f "cloud-sql-proxy --port 15433" when done.
Vulnerability exception log (read every run)
docs/pentest/VULNERABILITY_EXCEPTIONS.md is the operator's documented-risk register for advisories that aren't patched within the standard SLA — required by § 164.308(a)(1)(ii)(B). Treat it as first-class evidence: it both suppresses known-and-accepted dep-scan hits in §6 and is itself a thing that can rot.
Find the file. Path varies by how the runner mounted the repo — try in order, stop at the first hit:
EXC_FILE=""
for p in /workspace/pentest-bundle/repo/docs/pentest/VULNERABILITY_EXCEPTIONS.md \
/workspace/repo/docs/pentest/VULNERABILITY_EXCEPTIONS.md \
/app/docs/pentest/VULNERABILITY_EXCEPTIONS.md \
docs/pentest/VULNERABILITY_EXCEPTIONS.md; do
[ -f "$p" ] && EXC_FILE="$p" && break
done
If no path resolves, that's a MEDIUM finding in §6 — the operator cannot demonstrate documented risk decisions for unpatched advisories. Do NOT silently treat absence as "no exceptions."
Per open entry (under ## Open), parse ### <advisory-id>, Severity, Status, Revisit by, and the entry's age. Age sources, in order:
- An explicit
Raised:field, if present (preferred — push the operator to add one when missing). - Git first-add date for the entry heading (fallback):
REPO_ROOT="$(git -C "$(dirname "$EXC_FILE")" rev-parse --show-toplevel)" git -C "$REPO_ROOT" log --diff-filter=A --format=%aI -S"### CVE-2026-3219" -- "$EXC_FILE" | tail -1 - The file's first-add date as the conservative upper bound if neither works.
Compute age_days = today - raised. Today is date -u +%Y-%m-%d.
Cross-reference into §6. For any §9 dep-scan hit (pip-audit, trivy, osv-scanner) whose advisory ID is listed under ## Open, do not promote it on dep-scan severity alone — record it as INFO with Status: documented-exception (see §9 Documented exceptions → <advisory-id>). That's the entire purpose of the log.
Staleness rule — HIGH finding when triggered. Any open entry with age_days > 30 ships in §6:
- ID:
PABLO-EXC-STALE-<advisory-id> - Title:
Vulnerability exception stale (>30 days) — refresh or remediate: <advisory-id> - Severity: HIGH (§ 164.308(a)(1)(ii)(B) — documented risk decisions must be re-reviewed; >30 days without update means the decision is unverified against current upstream state)
- Description: Exception was raised
days ago. Either upstream now ships a fix (close the entry), the situation has changed (refresh Why not patched,Compensating control, andRevisit bywith current rationale), or the risk acceptance is silently aging — none acceptable to OCR. - Remediation: Patch the dep if a fix exists, OR update the entry with current rationale and a new
Revisit by. BumpingRevisit byalone without re-justifying does not clear this finding — the staleness check is onRaised:/git-add date, not onRevisit by. To reset the clock, add or update aRaised:field to today and document what was re-verified. - Owner: copy the entry's
Owner(default Kurt Niemi).
If age_days ≤ 30 for every open entry → §6a positive control row: "Exception log fresh — N open, oldest
Checklist (run in order — keep it updated as findings stabilize)
Recon — TLS (
openssl s_client -tls1_1should fail with alert 70), cert chain, security headers on frontend + backend,/api/config,/api/health,/docs//openapi.json(404 in prod — good),/api/ext/auth/seed-admin(404 expected — referenced but not implemented),robots.txt. Runtestssl.shfor full TLS config coverage. 1b. Scripted sweep — runnucleiandffufagainst the backend URL (sanctioned invocations above). Review findings manually; fold HIGH/CRITICAL into the report. Skipnikto/sqlmapunless the sweep surfaces a lead. 1c. Static analysis + dependency scan —semgrep(code),pip-audit(Python deps),trivy imageagainst the deployed backend tag,osv-scanner --recursive /app/backend,gitleaks detect --source /app/backend --no-git. Feeds §8 and §9 of the report. Anygitleakshit = automatic CRITICAL finding in §6. 1d. Vulnerability exception log review — locatedocs/pentest/VULNERABILITY_EXCEPTIONS.md(see "Vulnerability exception log" section above), parse every entry under## Open, computeage_daysper entry, cross-reference advisory IDs against the §1c dep-scan output, and apply the staleness rule (>30 days → HIGH §6 finding). Missing file → MEDIUM finding.Cloud Run IAM —
gcloud run services get-iam-policy <svc>. Both backend and frontend areallUsers → roles/run.invokerby design; auth is at app layer. Verify with unauthcurl→ expect 401 on/api/{users/me, patients, sessions, admin/users}and 403 "Service auth required" on/api/ext/auth/*.CORS — preflight with
Origin: https://evil.commust not be reflected. Legit origin = the frontend's Cloud Run URL; checkallow_credentials: trueis only paired with that.Auth flow — decode JWT (
alg=RS256,aud=<projectId>,firebase.sign_in_second_factor=totp). Tamper toalg=none→ 401. SwapX-Tenant-IDheader → ignored (tenant from JWT).Unauth surface —
/api/ext/auth/check-allowlistandcheck-statusrequire Bearer. Recurring HIGH check:ext_auth.py:_verify_blocking_function_tokenmust passaudience=<backend URL>togoogle.oauth2.id_token.verify_token. If theaudience=kwarg is still missing, file it again.Cross-tenant IDOR — do this autonomously, do NOT ask the user to create accounts. The runner has already provisioned two MFA-enrolled ephemeral users for you on this run: tokens in
$PENTEST_TEST_ID_TOKEN_A/$PENTEST_TEST_ID_TOKEN_B(emails in$PENTEST_TEST_EMAIL_A/_B, uids in$PENTEST_TEST_UID_A/_B). Exercise both:- With
$PENTEST_TEST_ID_TOKEN_A,POSTone patient + one session through the authenticated endpoints (name prefixPENTEST-<run-uuid>) so there is a known-real A-owned record. Capture the returned UUIDs. - With
$PENTEST_TEST_ID_TOKEN_B,GETeach of:/api/patients/<A's-patient-uuid>,/api/sessions/<A's-session-uuid>,/api/soap-notes/<A's-soap-uuid>(if created),/api/appointments/<A's-appt-uuid>(if created). Expect 404 from the repo layer (NOT 403 — 403 leaks existence). A 200 is a cross-tenant BOLA = CRITICAL + Breach candidate § 164.402. A 403 is HIGH (info leak via presence signal). - Repeat write-side probes:
PUT/DELETEon the same foreign UUIDs with$PENTEST_TEST_ID_TOKEN_B→ expect 404. A 200/204 means a write IDOR — CRITICAL + Breach candidate. - Probe
X-Tenant-IDheader swap on the same endpoints — the header must be ignored (tenant is taken from JWT). - Horizontal BOLA within a shared tenant (A and B land in the same tenant in single-tenant deployments): same probe set exercises clinician-scoped isolation. Record which scenario you're running (cross-tenant vs same-tenant horizontal) in §6.
Fallback — if both tokens are missing (identity bootstrap failed): note explicitly in §12 that "2-account IDOR was not exercised this run — identity bootstrap returned no credentials". Do NOT fabricate results and do NOT attempt to mint new Firebase accounts inline — the runner owns that lifecycle.
- With
Privilege escalation —
/api/admin/*with clinician token →ADMIN_REQUIRED. JWT tamper (won't work, Firebase verifies sig).Injection (targeted) — 1–2 probes each on reflected XSS, SQLi (boolean+time on obvious params), SSRF on iCal
feed_url(hostname allowlist + scheme check; don't bother with 169.254.169.254 — it's blocked by hostname).Rate limiting — 5 bad passwords on the real account via Firebase
signInWithPassword(Firebase handles lockout; don't exceed 10). Skip if prior run already tripped lockout.Upload DoS —
Content-Length: 2000000000+ tiny body header-only probe (reject on header = good). Recurring MEDIUM check:sessions.py:upload_audioshould read in bounded chunks, notawait file.read()before size check.MFA integrity — POST
/api/users/me/mfa-enrolledas a fresh non-MFA sign-up. Recurring MEDIUM: returns a newmfa_enrolled_atwithout verifying Firebase-side enrollment. Doesn't grant access (JWT claim still gates), but poisons compliance metrics.Signup hygiene —
identitytoolkit /accounts:signUpwith@example.invalidshould fail in a locked-down deployment. If it succeeds,restrict_signups=false— flag it.Cloud SQL direct — see section above.
audit_logstable existence is the key HIPAA check.
Deliverable — HIPAA-grade report format
Target audience: HHS OCR auditors, Pablo's owner/operator, and an external qualified pentester using this report as their scoping input. Every run produces the full artifact — there is no "lite" mode.
Never skip a finding. Every issue observed — INFO through CRITICAL — lands in §6 with a full write-up. Severity is how findings are ordered and highlighted, not a gate for inclusion. A "clean" run still enumerates the INFO-level observations it considered. §2 highlights CRITICAL/HIGH for the reader; nothing is dropped because it looked minor. If you considered an issue and decided it was a non-issue, that belongs in §6a (Positive controls and items tested clean) with the reasoning, not silently omitted.
Positive findings are first-class. §6a is mandatory — it captures the controls that passed and the things the assessor considered and dismissed, with the same evidence rigor as a finding (what was tested, what the expected behavior was, what was observed). This is what a Covered Entity shows an auditor to demonstrate that the technical safeguards are actually working, not just written down, and that the assessor looked at the full surface.
Read this disclaimer onto every report cover: this is an automated self-assessment driven by an LLM, not an independent qualified third-party pentest. For full HIPAA §164.308(a)(8) defensibility (2024 NPRM anticipated to finalize in 2026), Pablo should still engage an independent qualified pentester at least annually. This weekly artifact complements — does not replace — that engagement, and is meant to surface issues between formal engagements + give the external tester a scoped starting point.
Report sections, in order (use these exact headings):
1. Cover & scope
Covered entity (Pablo Health, LLC); report ID (PABLO-PENTEST-<date>-<run UUID>); reporting period (since prior run in GCS); tester identity (CLI + model); authorization statement; in-scope systems (Cloud Run × 2, Cloud SQL, Firebase, GCS compliance bucket, Secret Manager, backend source); out-of-scope (GCP/Firebase/Vertex infra, third-party source repos, physical, social engineering, wireless); run metadata (project ID, URLs, connection name, region).
2. Executive summary
3–5 bullets, business-risk language. Lead with severity totals and trend vs prior run.
3. Asset inventory & data flow
From gcloud run services list, gcloud sql instances list, gcloud secrets list, gsutil ls, /api/config. Table: Asset | Type | ePHI touched | Region | Encryption at rest | Encryption in transit. Text data-flow.
Observed external egress destinations — enumerate from code + config, then reason about each destination's BAA coverage. List:
- Deployed service env vars that route inference:
gcloud run services describe pablo-backend --region=us-central1 --format='value(spec.template.spec.containers[0].env)'→ recordCLAUDE_CODE_USE_VERTEX,GOOGLE_GENAI_USE_VERTEXAI,ANTHROPIC_VERTEX_PROJECT_ID,GOOGLE_CLOUD_LOCATIONvalues. - All external hostnames the backend can reach:
grep -rEoh "https?://[a-zA-Z0-9.-]+" /app/backend | sort -u
For each destination the backend can send request bodies (prompts, transcripts, patient fields) to, determine whether it is covered by a Business Associate Agreement — Vertex AI under the project's Google Cloud BAA is covered; api.anthropic.com, api.openai.com, generativelanguage.googleapis.com (public Gemini direct) are not covered by the Google Cloud BAA. Apply the severity rubric below: ePHI traversing a non-BAA destination is a § 164.504(e) permitted-use violation and a § 164.402 Breach candidate — score per the rubric, do not pre-commit to a specific tier here.
4. Threat model (STRIDE-lite, Pablo-specific)
Actors: unauth internet; authenticated clinician cross-tenant; authenticated clinician horizontal BOLA within tenant; insider with GCP IAM; compromised dependency; compromised subprocessor. Crown jewels: patient records, session transcripts, audit_logs, Firebase/GCP credentials, AUTH_SECRET/JWT_SECRET_KEY. Map each actor to attack paths and to the findings in §6.
5. Methodology & frameworks
OWASP WSTG v4.2, OWASP API Security Top 10 (2023), OWASP ASVS 4.0 Level 2, PTES Technical Guidelines, NIST SP 800-115, HIPAA Security Rule §164.308/.310/.312/.314 (2024 NPRM). List tools from the scanner tooling section that were invoked with which flags.
6. Findings
Table: ID | Title | Severity | CVSS 3.1 vector | CWE | §164 control | Asset | Status. One subsection per finding with Description / Evidence (redacted) / Reproduction / Business impact / Remediation / Owner (default: CODEOWNERS or Kurt Niemi) / Target resolution (CRITICAL ≤7d, HIGH ≤30d, MEDIUM ≤90d, LOW ≤180d) / Status (new | carry-over-from:<prior report ID> | resolved-this-run | regression).
Severity rubric — HIPAA overlay applies. Raw CVSS underweights PHI impact: a "medium" CVSS CVE that enables PHI access is a § 164.402 Breach and ships as CRITICAL. Use:
effective_severity = max(cvss_tier, phi_impact_tier)
where phi_impact_tier is:
- CRITICAL — any unauthenticated PHI read; cross-tenant or horizontal PHI access by an authenticated user; PHI integrity compromise (write/delete across tenant boundary); PHI egress to infrastructure not covered by a Business Associate Agreement (§ 164.504(e) — the disclosure path itself is an impermissible use regardless of whether a Breach has occurred yet); any condition that would meet the § 164.402 "acquisition, access, use, or disclosure of PHI in a manner not permitted" definition of a reportable Breach.
- HIGH — authenticated privilege escalation; auth-bypass that could lead to PHI without a second bug; PHI availability loss > 24h; any secret in git (
gitleakshit); missing/non-firing audit logs on PHI routes (§ 164.312(b) gap). - MEDIUM — info disclosure without direct PHI nexus; DoS vector; missing defense-in-depth on a PHI-adjacent path.
- LOW — defense-in-depth gap with no PHI nexus.
- INFO — observation / posture note.
"PHI-adjacent" means the path touches metadata, signals, or fields that would appear in a HIPAA compliance report, audit trail, or Breach risk-assessment, even when the underlying PHI is not itself exposed. Example: /api/users/me/mfa-enrolled poisons the mfa_enrolled_at field that feeds § 164.308(a)(5) workforce-MFA attestations — PHI-adjacent, MEDIUM. Counter-example: missing HSTS on a static-asset CDN with no session tokens — no PHI nexus, LOW.
Baseline examples (the raw CVSS column is what a pure-technical scorer would emit; effective is what ships):
| Finding | CVSS tier | PHI tier | Effective | Note |
|---|---|---|---|---|
Unauth GET /api/patients/<id> returns 200 |
High | Critical | Critical | § 164.402 Breach — reportable |
| Authed clinician B reads tenant A's patient | High | Critical | Critical | Cross-tenant BOLA = Breach |
| Stored XSS in session notes | Medium | Critical | Critical | PHI integrity + exfiltration path |
audit_logs silent gap > 24h on PHI route |
Medium | High | High | § 164.312(b) enforcement failure |
| Stale npm dep, CVSS 7.5, no PHI reachability | High | None | High | No overlay needed |
| HSTS header missing | Low | None | Low | Defense-in-depth only |
Backend routes inference to api.anthropic.com (not Vertex); transcripts contain PHI |
Low (config) | Critical | Critical | PHI disclosed to non-BAA subprocessor = § 164.504(e) impermissible use + § 164.402 Breach candidate — Breach candidate: § 164.402 |
Compute and emit a CVSS 3.1 base vector for every finding (e.g., AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:N) alongside the effective HIPAA-overlay severity, so auditors can see both scores and the reasoning.
Breach-candidate flag. Any finding scored CRITICAL because of a PHI nexus (not because CVSS alone was Critical) must carry a Breach candidate: § 164.402 tag in its subsection header, with a sentence naming the specific disclosure path. This is what separates a HIPAA-aware report from a generic CVE dump, and it's the line an OCR reviewer looks for first.
Methodology reference: OWASP Risk Rating Methodology (business-impact multiplier), HITRUST CSF Risk Analysis Guide, HHS OCR "Guidance on Risk Analysis Requirements" under § 164.308(a)(1)(ii)(A). The max(cvss_tier, phi_impact_tier) formula is the common Coalfire/Clearwater pattern for HIPAA-aware pentest reports.
6a. Positive controls and items tested clean
Single evidence-rigor table — no parallel bullet list elsewhere. Columns: Control | What was tested | Expected behavior | Observed evidence | §164 control mapped. Draw from the checklist (§1–§13) — at minimum cover: TLS ≥1.2 enforcement, security headers (HSTS / CSP / X-Content-Type-Options), CORS origin allowlist, unauth 401 on PHI routes, JWT alg=none rejection, X-Tenant-ID header ignored, cross-tenant 404-not-403, admin-route ADMIN_REQUIRED enforcement, Cloud Run egress env vars pointing to Vertex (not public Gemini/Anthropic), audit_logs table present with recent rows, audit schema PHI-free, CI supply-chain scans passing on the deployed tag. Every row cites specific evidence (status code, header value, row count, screenshot-equivalent text output).
Also include controls the model considered and dismissed as non-issues — e.g. "/api/ext/auth/seed-admin returns 404 as expected, not a live backdoor despite setup-solo.sh reference." This is what shows the auditor the technical safeguards are working and the assessor looked at the full surface, not merely what was written down. Dismissed scanner false positives belong in §8 / §9 with the tool context, not here.
7. HIPAA Security Rule control matrix
Copy the table under HIPAA control matrix below into this section, filling Status / Evidence / Gap from this run's observations. Every row required; N/A rows need justification. Every Partial or Fail must link to a finding in §6.
8. Static analysis findings (semgrep)
Only confirmed-real hits, grouped by rule ID with file:line. Dismissed false positives stay here as a Dismissed: subsection with the reason — not duplicated in §6a or Appendix B.
9. Dependency & supply-chain scan
CI already runs pip-audit, trivy, and npm audit on every PR + weekly (see .github/workflows/ci.yml + security.yml). The pentest adds what CI can't: (a) scanning what's currently deployed (may lag main), (b) self-contained evidence inline in the report (auditors shouldn't need to cross-reference the GitHub Security tab). Flag any divergence between "CI clean" and "deployed tag vulnerable" as its own finding.
Subsections:
- Python deps —
pip-audit:Package | Installed | Vulnerable | Fixed | CVE | CVSS - Deployed container image —
trivy image <deployed tag>where the tag comes fromgcloud run services describe pablo-backend --region=us-central1 --format='value(spec.template.spec.containers[0].image)' - Multi-ecosystem —
osv-scanner --recursive /app/backend - Secrets in git —
gitleaks detect --source /app/backend --no-git(any hit = CRITICAL finding in §6) - Documented exceptions — table of every entry under
## Openindocs/pentest/VULNERABILITY_EXCEPTIONS.md. Columns:Advisory | Package | Severity | Status | Raised | Age (days) | Revisit by | Stale (>30d)?. Note the file path actually used (or "file not found" with the paths searched). For each row whose advisory ID also appears inpip-audit/trivy/osv-scanneroutput above, add aCross-ref:line under the row pointing at that scanner's finding ID — and downgrade the §6 entry for that advisory to INFO withStatus: documented-exception. Stale rows (Age > 30) get a separate HIGH §6 finding per the staleness rule.
10. Prior-run carry-over & trends
Fetch the newest .md before today from gs://<COMPLIANCE_REPORT_BUCKET>/pentest/. Diff into: Resolved since last run / Persisting (with "consecutive runs open" counter) / Regressions (elevate to HIGH minimum) / New this run. If no prior report exists, mark this as the baseline.
11. Endpoint coverage matrix
Group by auth profile, not by individual route. Columns: Auth profile | Tenant-scoped | Route count | Representative routes | Tests run | Result. One row per (auth_requirement, tenant_scope) combination — e.g. one row covering all require_mfa + tenant-scoped PHI routes, one for require_mfa + admin-only, one for get_current_user_no_mfa, one for public. List 2–3 representative routes per row; the exhaustive enumeration goes in Appendix B under endpoints.txt.
Any route that doesn't fit a known profile (missing Depends(), explicit public=true tag, unusual composition) gets its own row. Every unusual row needs either a positive test result or an explicit "deferred to external pentester" reason — never "ran out of time."
Enumerate the full set for the appendix:
grep -rnE "@router\.(get|post|put|delete|patch)\(" /app/backend/app/routes/ > /tmp/endpoints.txt
12. Automated assessment scope boundaries
Describe what is outside the boundary of this automated engagement and why, framed as scope constraints rather than gaps. The audience is auditors establishing what human-led testing should cover next; frame each entry as "requires human judgment / multi-session context / physical access / social engineering" rather than "we didn't test this." Entries belong here when they fall outside what any automated tool can assert — business-logic depth, multi-step stateful workflows, novel zero-days, prompt-injection depth, physical controls, social engineering. Do not list items that were simply skipped due to time; those are findings or fallback notes in the relevant section.
Static analysis exclusions (semgrep): The following paths are excluded from semgrep via --exclude flags (rationale mirrored in /.semgrepignore). A human reviewer or independent pentester should audit these paths directly:
| Excluded path | Rule(s) suppressed | Rationale |
|---|---|---|
alembic/ |
avoid-sqlalchemy-text |
Migration DDL; text() calls are operator-authored SQL, never reachable from web requests |
tests_integration/, tests/ |
avoid-sqlalchemy-text |
Test fixture DDL; not on the production attack surface |
app/db/__init__.py, app/db/provisioning.py |
avoid-sqlalchemy-text |
Tenant schema management DDL (CREATE SCHEMA, SET search_path, pg_advisory_lock, RLS policy setup); all text() arguments are system-generated, never from user input |
app/jobs/pentest_*.py |
dynamic-urllib-use-detected |
Self-assessment tooling; dynamic outbound calls target IAM-gated GCP admin APIs and operator-configured webhook URLs |
app/jobs/hipaa_log_review.py |
dynamic-urllib-use-detected |
Audit log reviewer; urllib targets an operator-configured webhook (admin-only runtime config) |
Inline # nosemgrep annotations suppress individual false-positive hits where exclusion would be too broad (e.g. auth/service.py unverified-JWT routing helper, logger calls in auth handlers).
13. Prioritized remediation roadmap
Ordered list grouped by severity. Columns: Finding ID | Severity | Effort (S/M/L) | Target date | Owner | Retest by. The Retest by column defaults to "next scheduled run" — only override for CRITICAL items that need sooner re-verification (e.g. "next run + manual curl confirmation within 7 days"). The §15 retest-plan section was merged here; if a finding needs a bespoke retest procedure beyond "run this skill again," describe it inline in that finding's §6 subsection under a Retest procedure: line.
14. Appendices
The pentest_runner.py wrapper uploads every raw scanner artifact to gs://<COMPLIANCE_REPORT_BUCKET>/pentest/<run-uuid>/raw/ with retention lock. The report inlines findings-level output (the specific lines that drove a conclusion) and links to the GCS object for full raw dumps. Inline blocks are capped at ~50 lines each — anything longer, link out.
- A: Commands executed — chronological list of every shell command/API call the run made (redact tokens and any PHI). One command per line. Audit defense: an OCR reviewer must be able to reconstruct what actually ran. This one stays fully inline — it's short and load-bearing.
- B: Scanner invocations & findings — one subsection per tool (
nuclei,ffuf,semgrep,pip-audit,trivy,osv-scanner,gitleaks,testssl.sh,nikto/sqlmapif invoked, plus theendpoints.txtenumeration from §11). For each: exact invocation, exit code, summary line counts (total / high / medium / low / info), inline excerpts of the specific lines that became §6 findings, and aRaw output: gs://.../raw/<tool>.txtlink. Dismissed false positives live here with their dismissal reason — do not duplicate them in §6a. - C: Cloud SQL query log — every
SELECTrun against the DB and the row counts returned (values redacted). Stays fully inline. - D: SBOM — link to
gs://.../raw/sbom.cyclonedx.json; inline a one-line summary (component count, critical CVE count). If the image scan is unavailable, note that and link topip-list.txtinstead. - E: Attestation block — tester identity (CLI + model), run UUID, ISO timestamp, SHA256 of the report body (excluding this block). Unsigned (automated run); needs human countersign before the operator uses it as input to the annual §164.314(a) written verification to Covered Entities.
Output handling: emit the complete markdown report — through all appendices — to stdout. Do NOT gsutil cp or otherwise upload to GCS yourself; the calling runner (pentest_runner.py) captures stdout and uploads to the retention-locked compliance bucket. Uploading from inside the skill creates duplicate objects with inconsistent metadata. The final thing you emit must be the closing of appendix E; do not append a trailing "uploaded to gs://…" line.
Length & tone: body 2000–3000 words; appendices on top of that. Inline evidence for anything ≤50 lines; link to the GCS raw-artifact bucket for longer dumps (see §14). Audit-ready neutral tone. Every "pass" claim cites observable evidence.
HIPAA control matrix (copy into §7 of every report)
Administrative safeguards (§164.308)
| Control | Requirement | Status | Evidence | Gap |
|---|---|---|---|---|
| §164.308(a)(1)(ii)(A) | Risk analysis — accurate, thorough, documented; reviewed ≥12mo | |||
| §164.308(a)(1)(ii)(B) | Risk management — reduce risks to reasonable level | |||
| §164.308(a)(1)(ii)(D) | Information system activity review — logs, access reports, incident reports | |||
| §164.308(a)(3)(i) | Workforce authorization / supervision | |||
| §164.308(a)(3)(ii)(C) | Termination procedures — revoke access on departure | |||
| §164.308(a)(4)(ii)(B) | Access authorization — granted per role | |||
| §164.308(a)(4)(ii)(C) | Access establishment & modification | |||
| §164.308(a)(5)(ii)(C) | Log-in monitoring — detect anomalies | |||
| §164.308(a)(5)(ii)(D) | Password management (NPRM: MFA required) | |||
| §164.308(a)(6)(ii) | Security incident response & reporting | |||
| §164.308(a)(7)(ii)(A) | Data backup plan — tested | |||
| §164.308(a)(7)(ii)(B) | Disaster recovery plan — restore ≤72h (NPRM) | |||
| §164.308(a)(7)(ii)(D) | Contingency plan testing ≥12mo | |||
| §164.308(a)(8) | Technical evaluation (this report) — annual pentest + biannual vuln scan (NPRM) |
Physical safeguards (§164.310) — out of scope for this automated scan (cloud-inherited or workforce-level). Operator tracks separately.
Technical safeguards (§164.312) — the heart of this pentest.
| Control | Requirement | Status | Evidence | Gap |
|---|---|---|---|---|
| §164.312(a)(1) | Unique user identification | |||
| §164.312(a)(2)(ii) | Emergency access procedure | |||
| §164.312(a)(2)(iii) | Automatic logoff / session timeout | |||
| §164.312(a)(2)(iv) | Encryption / decryption of ePHI at rest (NPRM: required) | |||
| §164.312(b) | Audit controls — record & examine activity | |||
| §164.312(c)(1) | Integrity — protect ePHI from improper alteration/destruction | |||
| §164.312(c)(2) | Mechanism to authenticate ePHI (NPRM) | |||
| §164.312(d) | Person or entity authentication (NPRM: MFA required) | |||
| §164.312(e)(1) | Transmission security | |||
| §164.312(e)(2)(i) | Integrity controls in transit | |||
| §164.312(e)(2)(ii) | Encryption in transit (NPRM: required) |
Organizational / BA contracts (§164.314) — paper controls, out of scope for this automated scan. Operator tracks separately.
Administrative (§164.308) non-technical rows — rows like workforce authorization, termination procedures, risk analysis documentation are paper controls. The scanner only fills rows where it has direct technical evidence (e.g. §164.308(a)(5)(ii)(D) MFA via JWT claim inspection, §164.308(a)(1)(ii)(D) via audit_logs freshness). Mark the rest N/S (out of automated scope) — do not mark them Pass/Fail from assumption.
Status values: Pass / Partial / Fail / N/S (out of automated scope). Every Partial / Fail must link to a finding in §6. N/S rows do not require evidence — they're tracked outside this report.
Known recurring findings (verify each run — patches may land between runs)
Migration target: any row whose re-test is a one-line grep belongs in make lint (as a custom semgrep rule or a tiny pytest) rather than in a weekly pentest. When a regression check drops into the pentest twice in a row and has a mechanical verifier, file a task to promote it to CI and remove the row here. The pentest should focus on things a static check can't catch (live config, cross-tenant behavior, deployed-image vs source drift).
| Findings seen in past runs | Re-test |
|---|---|
ext_auth.py verify_token missing audience= |
grep ext_auth.py for verify_token( — confirm audience= kwarg present |
AuditService() instantiated without DB → logger.info only |
grep get_audit_service and _persist — confirm a Postgres write path |
sessions.py:upload_audio reads before size check |
grep upload_audio for await .*\.read() — confirm chunked reads |
/api/users/me/mfa-enrolled trusts client |
grep mfa-enrolled for Firebase Admin MFA verification |
restrict_signups=false on this deployment |
gcloud run services describe pablo-backend --region=us-central1 env vars + live signUp probe |
Cleanup checklist (do before writing the report)
- Delete any test users:
curl -X POST "identitytoolkit /accounts:delete" -d '{"idToken":"<tok>"}'(sign in to get a fresh idToken first). -
pkill -f "cloud-sql-proxy --port 15433"(or whatever port you used). - Confirm no UPDATE/DELETE/INSERT was issued (review your psql commands).
- Redact any PHI names/emails/DOBs from evidence strings in the report.