update-form-483s

name: update-form-483s description: Idempotent end-to-end workflow to sync the mycq.ai Form 483s database from FDA FOIA Reading Room + Data Dashboard sources, extract observations from PDFs, backfill location data, and validate completeness via the 5-scan blindspot gate. Use when asked to "update form 483s", "sync 483s data", "refresh 483 observations", or as a recurring weekly/monthly maintenance job.

update-form-483s — End-to-End Database Maintenance Skill

Why this skill exists

Sessions 28L → 28-fix → 29 produced 3 iteration waves before the form_483s pipeline displayed correctly in production. Each wave only fixed the visible symptoms of the prior wave's reported bug. This skill codifies every lesson so the next refresh runs in one pass without manual debugging.

Use this skill whenever:

New FDA FOIA xlsx is published (FDA updates monthly)
User reports any Form 483s display issue ("0 observations", "Unknown country", "Invalid id")
Recurring weekly/monthly database freshness audit
Any time a script writes to form_483s, form_483_observations, fda_inspections, or warning_letters

Prerequisites

Working directory: D:\AI\创业项目\FDA\mycq.ai\regulatory-dashboard
Env vars in .env.local: NEXT_PUBLIC_SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, OPENROUTER_API_KEY (free), optionally ANTHROPIC_API_KEY (Pass 2 fallback)
Supabase project_id: zzilbkehuxvbajupambt
Latest FDA FOIA xlsx in FDA forms/最新483s.xlsx (download from https://www.fda.gov/...).

The 7-step pipeline (run in order)

Step 1 — FOIA xlsx → form_483s sync (catches new records)

node scripts/sync-fda-foia-483s.cjs --length 500 --no-refresh  # MV refresh deferred to Step 5

What it does: Pulls the FOIA Reading Room HTML (DataTables-style ?length=500), parses MM/DD/YYYY dates, INSERTs missing records with record_id=fda-foia-{FEI}-{YYYYMMDD}, UPDATEs gaps in country/state/publish_date/download_url on existing rows. Uses (fei_number, record_date) as the dedup key.

Gotchas:

Drupal Views ?page=N silently no-ops; only ?length=N works.
FDA DataDashboard returns ISO alpha-2 codes (US/DE/JP), not alpha-3. The COUNTRY_CODE_MAP must include alpha-2 keys.

Expected outcome: 0-20 new INSERTs per monthly run; ~5-15 country/state UPDATEs on previously-incomplete rows.

Step 2 — country backfill for legacy fda-* rows

node scripts/fix-fda-483s-country.cjs --no-refresh  # MV refresh deferred to Step 5

What it does: SELECTs all record_id LIKE 'fda-%' AND country IS NULL rows. For each unique FEI, POSTs to https://api-datadashboard.fda.gov/v1/inspections_classifications. Resolves country via COUNTRY_CODE_MAP and UPDATEs form_483s.

Expected outcome: 0-20 updates per run (most legacy data is already filled).

Step 3 — Run the gap-fill observation extraction (covers 100% of `form_483s` with `download_url`)

This is the architectural lesson from Session 29. The script iterates form_483s directly with LEFT JOIN form_483_observations IS NULL — NOT the form_483_foia_audit table (that was Session 28L's mistake which missed 229 PDFs).

# Pass 1 + Pass 2 (Claude Haiku vision fallback) — primary path
node scripts/extract-observations-gap-fill.cjs

# If Anthropic credit exhausted, finish with free OpenRouter Pass 3 (model chain):
node scripts/extract-observations-gap-fill-openrouter.cjs                                  # default: gemma-4-31b
node scripts/extract-observations-gap-fill-openrouter.cjs --model "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free"
node scripts/extract-observations-gap-fill-openrouter.cjs --model "google/gemma-4-26b-a4b-it:free"

Pipeline inside the script:

Pass 1 (free) — pdfjs-dist text extraction + regex /(?:^|[\n\s.;:!?])(?:OBSERVATION|Observation|OBSERVATIONS|OBS\.?)\s+(\d+)\b/g. Typically covers ~76% of PDFs at $0.
Pass 2 (~$0.014/PDF) — claude-haiku-4-5 vision; triggered only when Pass 1 returns 0 AND totalChars < 200 (true scan). $5 hard abort.
Pass 3 chain (all free) — for cases Pass 2 couldn't bill (Anthropic credit exhausted), iterate through the model fallback chain via --model flag. Empirically validated chain (Session 29 yielded 15/15 recoverable = 100%):
- google/gemma-4-31b-it:free (default, largest 31B vision) — gets 5/12 (~42%)
- nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free — picks up 5 more of Gemma's HTTP 429 rate-limit failures (~71% cumulative)
- google/gemma-4-26b-a4b-it:free — closes the long tail (Nemotron 30B null-output cases) → 100% cumulative
- nvidia/nemotron-nano-12b-v2-vl:free — Session 28L's original Pass 2 model; mid-tier quality (~14% solo). Use only if all 3 above are rate-limited.
- Why a chain works: each model has separate upstream rate limits + different output formats. HTTP 429 from one model = healthy budget at another.

Resumability: scripts/.gap_fill_checkpoint.json tracks processed_form_483_ids + failed_form_483_ids. Re-running with --resume skips processed rows.

Step 4 — Apply observation_count + location backfill migrations

npx supabase db push

Migrations applied:

20260516000000_backfill_form_483s_location_from_inspections.sql — COALESCE city/state/country from fda_inspections JOIN (8,665 cities backfilled in initial run)
20260516000001_resync_form_483s_observation_count.sql and 20260517000000_resync_observation_count_post_gapfill.sql — re-sync scalar observation_count to match COUNT(form_483_observations) for rows where they diverge

If npx supabase db push hangs on interactive prompts, use MCP fallback:

mcp__claude_ai_Supabase__apply_migration with the SQL from each .sql file

Step 5 — Refresh materialized views (mandatory after any form_483s write)

Per Session 23 lesson — company_stats and company_facility_links are MATERIALIZED VIEWs. Reading them is fast (~138ms) but recomputing takes ~18s, so writes don't trigger automatic refresh.

node scripts/refresh-company-views.cjs

If local PG env not set (Session 28k discovery):

mcp__claude_ai_Supabase__execute_sql: SELECT refresh_company_views();

Step 6 — Run the 5-scan blindspot gate (FORCING FUNCTION)

This is the completion checkpoint. Do NOT report "update complete" until all 5 scans show ✅. See full SQL in .claude/docs/data-pipeline-checklist.md.

Quick version — all 5 scans in one go:

SELECT
  -- A. has download_url + extraction NEVER ran (Session 29 root cause)
  (SELECT COUNT(*) FROM form_483s f
    LEFT JOIN (SELECT DISTINCT form_483_id FROM form_483_observations) o ON o.form_483_id=f.id
    WHERE f.download_url IS NOT NULL AND o.form_483_id IS NULL AND f.record_date IS NOT NULL) AS A_never_extracted,

  -- B1. scalar > 0 + 0 detail rows + has download_url (extractable gap — should approach 0)
  (SELECT COUNT(*) FROM form_483s f
    LEFT JOIN (SELECT form_483_id FROM form_483_observations GROUP BY form_483_id) o ON o.form_483_id=f.id
    WHERE f.observation_count > 0 AND o.form_483_id IS NULL AND f.download_url IS NOT NULL AND f.record_date IS NOT NULL) AS B1_extractable_gap,

  -- B2. scalar > 0 + 0 detail rows + NO download_url (structural; can't fix without acquiring PDFs)
  (SELECT COUNT(*) FROM form_483s f
    LEFT JOIN (SELECT form_483_id FROM form_483_observations GROUP BY form_483_id) o ON o.form_483_id=f.id
    WHERE f.observation_count > 0 AND o.form_483_id IS NULL AND f.download_url IS NULL AND f.record_date IS NOT NULL) AS B2_structural_gap,

  -- C. scalar disagrees with extracted COUNT
  (SELECT COUNT(*) FROM form_483s f
    JOIN (SELECT form_483_id, COUNT(*) AS c FROM form_483_observations GROUP BY form_483_id) o ON o.form_483_id=f.id
    WHERE f.observation_count IS DISTINCT FROM o.c) AS C_scalar_drift,

  -- D. FEI exists in fda_inspections but form_483s.city IS NULL
  (SELECT COUNT(*) FROM form_483s f
    JOIN (SELECT DISTINCT ON (fei_number) fei_number, city FROM fda_inspections WHERE fei_number IS NOT NULL ORDER BY fei_number, inspection_end_date DESC NULLS LAST) i ON i.fei_number=f.fei_number
    WHERE f.city IS NULL AND i.city IS NOT NULL) AS D_location_gap,

  -- E. record_id prefix family inconsistency
  (SELECT COUNT(*) FROM form_483s WHERE record_id NOT LIKE 'ft-%' AND record_id NOT LIKE 'fda-foia-%' AND record_id NOT LIKE 'FDA-%' AND record_id NOT LIKE 'fda-%') AS E_unknown_prefix;

Gate criteria:

A ≤ 30 (only true Pass 1+2+3 failures remain — scanned PDFs with no recoverable text)
B1 ≤ 50 (extractable gap; re-run Step 3 with --resume if higher)
B2 is informational only — represents form_483s rows where fdatracker scraped a count but we have no PDF. Not "fixable" via extraction; flag in UI as "full text unavailable" or backfill download_url from new source.
C = 0
D ≤ 270 (only FEIs absent from fda_inspections — unreachable via JOIN)
E = 0 (new prefix detected → frontend [id]/route.ts regex MUST be updated to match)

If any criterion fails, iterate until it passes before reporting completion. Don't repeat the Session 28L→29 wave-by-wave fix pattern.

Step 7 — Production verification

After successful gate pass:

npx tsc --noEmit && npx next build — confirm 0 TS errors, build clean
git add scripts/ supabase/migrations/ && git commit -m "..." && git push origin main — Vercel auto-deploys
Smoke-test in browser at mycq.ai/fda-tracker/form483s:
- List page shows real cities (not —), real observation counts (not 0 for FEI-bearing rows)
- Click into Wells Pharmacy / SCA Pharmaceuticals / Fujifilm Diosynth Texas → detail page renders with Observations tab populated
- URL mycq.ai/fda-tracker/form483s/fda-foia-3013736415-20260313 returns 200 (not "Invalid id")

Idempotency guarantees

Step 1 (sync) — (fei_number, record_date) dedup key prevents duplicate INSERTs. Safe to re-run hourly.
Step 3 (extraction) — checkpoint at scripts/.gap_fill_checkpoint.json prevents reprocessing. UPSERTs form_483_observations with onConflict: form_483_id,observation_number.
Step 4 (migrations) — both use IS DISTINCT FROM / COALESCE so re-applying is a no-op.
Step 5 (MV refresh) — CONCURRENTLY refresh; reads stay served during compute.
Step 6 (gate) — read-only.

Anti-patterns (don't repeat these)

Mistake	Lesson	Where it bit us
Iterate from a JOIN/audit table instead of the source-of-truth table	Always iterate `form_483s` for "all PDFs"; never trust a derived view as the universe	Session 28L missed 229 PDFs
INNER JOIN verification SQL	Use `LEFT JOIN ... IS NULL` to find "never processed", not `JOIN` to validate "processed quality"	Session 28-fix's verification masked the 229-row gap
Skip MV refresh after write	`company_stats` is MATERIALIZED → reads return stale until `refresh_company_views()` runs	Session 23: HTTP 500 from 25.5s recompute
Pay for vision before trying free text extraction	pdfjs+regex handles ~76% of FDA Form 483 PDFs at $0	Session 28L spent $17.65 when ~$4 would have sufficed
Add new `record_id` prefix without updating `[id]/route.ts` regex	The frontend URL parser rejects unknown prefixes as `Invalid id`	Session 28-fix: 1,964 detail pages broken until route fixed
Use `npm canvas` and `@napi-rs/canvas` interchangeably	`@napi-rs/canvas` hangs on `page.render()` with pdfjs v3; must use `canvas` (npm)	Session 28L lost ~1 hour to this
`pdfjs-dist v5+` ESM-only	Pin `pdfjs-dist@^3.11` for CJS `require()` compat	Session 28L lost ~30 min
Pass Node `Buffer` directly to pdfjs	pdfjs checks constructor name; must wrap: `new Uint8Array(buf.buffer, buf.byteOffset, buf.byteLength)`	Session 28L lost ~30 min

Cost expectations per run

Step	Cost	Notes
Step 1 sync	$0	HTTP scrape, no LLM
Step 2 country	$0	FDA API, no key needed
Step 3 Pass 1	$0	pdfjs+regex local
Step 3 Pass 2	$0-3	Only triggered for true-scan PDFs (<24% of new)
Step 3 Pass 3	$0	OpenRouter free tier (rate-limited 2s/call)
Steps 4-7	$0	SQL + verification only
Total per monthly refresh	$0-3	scales with how many new scanned PDFs FDA released

Recovery playbook (if something breaks mid-run)

Symptom	Action
`npx supabase db push` hangs	Skip → use MCP `apply_migration` with SQL from .sql file
`Anthropic credit too low` mid-Pass-2	Run `node scripts/extract-observations-gap-fill-openrouter.cjs` to finish via free Nemotron
Vercel build fails after push	`git revert HEAD && git push` — code changes are reversible; data changes via migrations are reversible by writing the inverse SQL migration
Production shows `Invalid id`	New `record_id` prefix detected by Scan E — update regex in `src/app/api/form483s/[id]/route.ts:RECORD_ID_PATTERN`
Production shows `0 Observations` for visibly-populated PDFs	Scan A returned >30 → re-run Step 3 with `--resume`; if still failing, check if PDFs are corrupted FDA-side (mark in `form_483_field_discrepancies` as `source_misclassified`)
`download_error HTTP 404` for FDA `/media/N/download` URLs	FDA removed the file server-side. Mark in `form_483_field_discrepancies` with `suggested_action='fda_url_dead_404_unrecoverable'` so future audits skip them. Session 29 found 6 such cases (id=335, 699, 1075, 1158, 1231, 1291).
OpenRouter returns `HTTP 429 rate-limited upstream` mid-batch	Switch model via `--model` flag — different upstream providers have separate rate limits. Empirically: Gemma 31B → Nemotron 30B → Gemma 26B chain hits 100% over 3 runs.
Model returns `Cannot read properties of null (reading 'trim')`	The model's output was null/empty (often Nemotron reasoning models). Swap to a non-reasoning model (Gemma 4 26B/31B) and retry.

Related files

scripts/sync-fda-foia-483s.cjs (Step 1)
scripts/fix-fda-483s-country.cjs (Step 2)
scripts/extract-observations-gap-fill.cjs (Step 3 — primary)
scripts/extract-observations-gap-fill-openrouter.cjs (Step 3 — free fallback)
scripts/refresh-company-views.cjs (Step 5)
supabase/migrations/20260516000000_backfill_form_483s_location_from_inspections.sql
supabase/migrations/20260516000001_resync_form_483s_observation_count.sql
supabase/migrations/20260517000000_resync_observation_count_post_gapfill.sql
.claude/docs/data-pipeline-checklist.md (Step 6 reference)
src/app/api/form483s/[id]/route.ts (frontend URL pattern — update when adding new record_id prefix)

Future improvements (not blocking)

Pre-flight Anthropic balance check — script should messages.create with max_tokens=1 before starting Pass 2 to detect credit exhaustion early.
Automatic prefix detection — Scan E should console.log the new prefix value and refuse to proceed; today it just returns a count.
Cron-scheduled monthly run — wire this skill into mcp__plugin_oh-my-claudecode_t__schedule for unattended execution.
Sentry/Slack alert on gate failure — currently the gate is human-triggered; could auto-alert if A/B/C/D/E exceed thresholds.