name: update-form-483s description: Idempotent end-to-end workflow to sync the mycq.ai Form 483s database from FDA FOIA Reading Room + Data Dashboard sources, extract observations from PDFs, backfill location data, and validate completeness via the 5-scan blindspot gate. Use when asked to "update form 483s", "sync 483s data", "refresh 483 observations", or as a recurring weekly/monthly maintenance job.
update-form-483s — End-to-End Database Maintenance Skill
Why this skill exists
Sessions 28L → 28-fix → 29 produced 3 iteration waves before the form_483s pipeline displayed correctly in production. Each wave only fixed the visible symptoms of the prior wave's reported bug. This skill codifies every lesson so the next refresh runs in one pass without manual debugging.
Use this skill whenever:
- New FDA FOIA xlsx is published (FDA updates monthly)
- User reports any Form 483s display issue ("0 observations", "Unknown country", "Invalid id")
- Recurring weekly/monthly database freshness audit
- Any time a script writes to
form_483s,form_483_observations,fda_inspections, orwarning_letters
Prerequisites
- Working directory:
D:\AI\创业项目\FDA\mycq.ai\regulatory-dashboard - Env vars in
.env.local:NEXT_PUBLIC_SUPABASE_URL,SUPABASE_SERVICE_ROLE_KEY,OPENROUTER_API_KEY(free), optionallyANTHROPIC_API_KEY(Pass 2 fallback) - Supabase project_id:
zzilbkehuxvbajupambt - Latest FDA FOIA xlsx in
FDA forms/最新483s.xlsx(download from https://www.fda.gov/...).
The 7-step pipeline (run in order)
Step 1 — FOIA xlsx → form_483s sync (catches new records)
node scripts/sync-fda-foia-483s.cjs --length 500 --no-refresh # MV refresh deferred to Step 5
What it does: Pulls the FOIA Reading Room HTML (DataTables-style ?length=500), parses MM/DD/YYYY dates, INSERTs missing records with record_id=fda-foia-{FEI}-{YYYYMMDD}, UPDATEs gaps in country/state/publish_date/download_url on existing rows. Uses (fei_number, record_date) as the dedup key.
Gotchas:
- Drupal Views
?page=Nsilently no-ops; only?length=Nworks. - FDA DataDashboard returns ISO alpha-2 codes (US/DE/JP), not alpha-3. The
COUNTRY_CODE_MAPmust include alpha-2 keys.
Expected outcome: 0-20 new INSERTs per monthly run; ~5-15 country/state UPDATEs on previously-incomplete rows.
Step 2 — country backfill for legacy fda-* rows
node scripts/fix-fda-483s-country.cjs --no-refresh # MV refresh deferred to Step 5
What it does: SELECTs all record_id LIKE 'fda-%' AND country IS NULL rows. For each unique FEI, POSTs to https://api-datadashboard.fda.gov/v1/inspections_classifications. Resolves country via COUNTRY_CODE_MAP and UPDATEs form_483s.
Expected outcome: 0-20 updates per run (most legacy data is already filled).
Step 3 — Run the gap-fill observation extraction (covers 100% of form_483s with download_url)
This is the architectural lesson from Session 29. The script iterates form_483s directly with LEFT JOIN form_483_observations IS NULL — NOT the form_483_foia_audit table (that was Session 28L's mistake which missed 229 PDFs).
# Pass 1 + Pass 2 (Claude Haiku vision fallback) — primary path
node scripts/extract-observations-gap-fill.cjs
# If Anthropic credit exhausted, finish with free OpenRouter Pass 3 (model chain):
node scripts/extract-observations-gap-fill-openrouter.cjs # default: gemma-4-31b
node scripts/extract-observations-gap-fill-openrouter.cjs --model "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free"
node scripts/extract-observations-gap-fill-openrouter.cjs --model "google/gemma-4-26b-a4b-it:free"
Pipeline inside the script:
- Pass 1 (free) —
pdfjs-disttext extraction + regex/(?:^|[\n\s.;:!?])(?:OBSERVATION|Observation|OBSERVATIONS|OBS\.?)\s+(\d+)\b/g. Typically covers ~76% of PDFs at $0. - Pass 2 (~$0.014/PDF) —
claude-haiku-4-5vision; triggered only when Pass 1 returns 0 ANDtotalChars < 200(true scan). $5 hard abort. - Pass 3 chain (all free) — for cases Pass 2 couldn't bill (Anthropic credit exhausted), iterate through the model fallback chain via
--modelflag. Empirically validated chain (Session 29 yielded 15/15 recoverable = 100%):google/gemma-4-31b-it:free(default, largest 31B vision) — gets 5/12 (~42%)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free— picks up 5 more of Gemma's HTTP 429 rate-limit failures (~71% cumulative)google/gemma-4-26b-a4b-it:free— closes the long tail (Nemotron 30B null-output cases) → 100% cumulativenvidia/nemotron-nano-12b-v2-vl:free— Session 28L's original Pass 2 model; mid-tier quality (~14% solo). Use only if all 3 above are rate-limited.- Why a chain works: each model has separate upstream rate limits + different output formats. HTTP 429 from one model = healthy budget at another.
Resumability: scripts/.gap_fill_checkpoint.json tracks processed_form_483_ids + failed_form_483_ids. Re-running with --resume skips processed rows.
Step 4 — Apply observation_count + location backfill migrations
npx supabase db push
Migrations applied:
20260516000000_backfill_form_483s_location_from_inspections.sql— COALESCE city/state/country fromfda_inspectionsJOIN (8,665 cities backfilled in initial run)20260516000001_resync_form_483s_observation_count.sqland20260517000000_resync_observation_count_post_gapfill.sql— re-sync scalarobservation_countto matchCOUNT(form_483_observations)for rows where they diverge
If npx supabase db push hangs on interactive prompts, use MCP fallback:
mcp__claude_ai_Supabase__apply_migration with the SQL from each .sql file
Step 5 — Refresh materialized views (mandatory after any form_483s write)
Per Session 23 lesson — company_stats and company_facility_links are MATERIALIZED VIEWs. Reading them is fast (~138ms) but recomputing takes ~18s, so writes don't trigger automatic refresh.
node scripts/refresh-company-views.cjs
If local PG env not set (Session 28k discovery):
mcp__claude_ai_Supabase__execute_sql: SELECT refresh_company_views();
Step 6 — Run the 5-scan blindspot gate (FORCING FUNCTION)
This is the completion checkpoint. Do NOT report "update complete" until all 5 scans show ✅. See full SQL in .claude/docs/data-pipeline-checklist.md.
Quick version — all 5 scans in one go:
SELECT
-- A. has download_url + extraction NEVER ran (Session 29 root cause)
(SELECT COUNT(*) FROM form_483s f
LEFT JOIN (SELECT DISTINCT form_483_id FROM form_483_observations) o ON o.form_483_id=f.id
WHERE f.download_url IS NOT NULL AND o.form_483_id IS NULL AND f.record_date IS NOT NULL) AS A_never_extracted,
-- B1. scalar > 0 + 0 detail rows + has download_url (extractable gap — should approach 0)
(SELECT COUNT(*) FROM form_483s f
LEFT JOIN (SELECT form_483_id FROM form_483_observations GROUP BY form_483_id) o ON o.form_483_id=f.id
WHERE f.observation_count > 0 AND o.form_483_id IS NULL AND f.download_url IS NOT NULL AND f.record_date IS NOT NULL) AS B1_extractable_gap,
-- B2. scalar > 0 + 0 detail rows + NO download_url (structural; can't fix without acquiring PDFs)
(SELECT COUNT(*) FROM form_483s f
LEFT JOIN (SELECT form_483_id FROM form_483_observations GROUP BY form_483_id) o ON o.form_483_id=f.id
WHERE f.observation_count > 0 AND o.form_483_id IS NULL AND f.download_url IS NULL AND f.record_date IS NOT NULL) AS B2_structural_gap,
-- C. scalar disagrees with extracted COUNT
(SELECT COUNT(*) FROM form_483s f
JOIN (SELECT form_483_id, COUNT(*) AS c FROM form_483_observations GROUP BY form_483_id) o ON o.form_483_id=f.id
WHERE f.observation_count IS DISTINCT FROM o.c) AS C_scalar_drift,
-- D. FEI exists in fda_inspections but form_483s.city IS NULL
(SELECT COUNT(*) FROM form_483s f
JOIN (SELECT DISTINCT ON (fei_number) fei_number, city FROM fda_inspections WHERE fei_number IS NOT NULL ORDER BY fei_number, inspection_end_date DESC NULLS LAST) i ON i.fei_number=f.fei_number
WHERE f.city IS NULL AND i.city IS NOT NULL) AS D_location_gap,
-- E. record_id prefix family inconsistency
(SELECT COUNT(*) FROM form_483s WHERE record_id NOT LIKE 'ft-%' AND record_id NOT LIKE 'fda-foia-%' AND record_id NOT LIKE 'FDA-%' AND record_id NOT LIKE 'fda-%') AS E_unknown_prefix;
Gate criteria:
- A ≤ 30 (only true Pass 1+2+3 failures remain — scanned PDFs with no recoverable text)
- B1 ≤ 50 (extractable gap; re-run Step 3 with
--resumeif higher) - B2 is informational only — represents
form_483srows where fdatracker scraped a count but we have no PDF. Not "fixable" via extraction; flag in UI as "full text unavailable" or backfilldownload_urlfrom new source. - C = 0
- D ≤ 270 (only FEIs absent from fda_inspections — unreachable via JOIN)
- E = 0 (new prefix detected → frontend
[id]/route.tsregex MUST be updated to match)
If any criterion fails, iterate until it passes before reporting completion. Don't repeat the Session 28L→29 wave-by-wave fix pattern.
Step 7 — Production verification
After successful gate pass:
npx tsc --noEmit && npx next build— confirm 0 TS errors, build cleangit add scripts/ supabase/migrations/ && git commit -m "..." && git push origin main— Vercel auto-deploys- Smoke-test in browser at
mycq.ai/fda-tracker/form483s:- List page shows real cities (not
—), real observation counts (not 0 for FEI-bearing rows) - Click into Wells Pharmacy / SCA Pharmaceuticals / Fujifilm Diosynth Texas → detail page renders with Observations tab populated
- URL
mycq.ai/fda-tracker/form483s/fda-foia-3013736415-20260313returns 200 (not "Invalid id")
- List page shows real cities (not
Idempotency guarantees
- Step 1 (sync) —
(fei_number, record_date)dedup key prevents duplicate INSERTs. Safe to re-run hourly. - Step 3 (extraction) — checkpoint at
scripts/.gap_fill_checkpoint.jsonprevents reprocessing. UPSERTsform_483_observationswithonConflict: form_483_id,observation_number. - Step 4 (migrations) — both use
IS DISTINCT FROM/COALESCEso re-applying is a no-op. - Step 5 (MV refresh) —
CONCURRENTLYrefresh; reads stay served during compute. - Step 6 (gate) — read-only.
Anti-patterns (don't repeat these)
| Mistake | Lesson | Where it bit us |
|---|---|---|
| Iterate from a JOIN/audit table instead of the source-of-truth table | Always iterate form_483s for "all PDFs"; never trust a derived view as the universe |
Session 28L missed 229 PDFs |
| INNER JOIN verification SQL | Use LEFT JOIN ... IS NULL to find "never processed", not JOIN to validate "processed quality" |
Session 28-fix's verification masked the 229-row gap |
| Skip MV refresh after write | company_stats is MATERIALIZED → reads return stale until refresh_company_views() runs |
Session 23: HTTP 500 from 25.5s recompute |
| Pay for vision before trying free text extraction | pdfjs+regex handles ~76% of FDA Form 483 PDFs at $0 | Session 28L spent $17.65 when ~$4 would have sufficed |
Add new record_id prefix without updating [id]/route.ts regex |
The frontend URL parser rejects unknown prefixes as Invalid id |
Session 28-fix: 1,964 detail pages broken until route fixed |
Use npm canvas and @napi-rs/canvas interchangeably |
@napi-rs/canvas hangs on page.render() with pdfjs v3; must use canvas (npm) |
Session 28L lost ~1 hour to this |
pdfjs-dist v5+ ESM-only |
Pin pdfjs-dist@^3.11 for CJS require() compat |
Session 28L lost ~30 min |
Pass Node Buffer directly to pdfjs |
pdfjs checks constructor name; must wrap: new Uint8Array(buf.buffer, buf.byteOffset, buf.byteLength) |
Session 28L lost ~30 min |
Cost expectations per run
| Step | Cost | Notes |
|---|---|---|
| Step 1 sync | $0 | HTTP scrape, no LLM |
| Step 2 country | $0 | FDA API, no key needed |
| Step 3 Pass 1 | $0 | pdfjs+regex local |
| Step 3 Pass 2 | $0-3 | Only triggered for true-scan PDFs (<24% of new) |
| Step 3 Pass 3 | $0 | OpenRouter free tier (rate-limited 2s/call) |
| Steps 4-7 | $0 | SQL + verification only |
| Total per monthly refresh | $0-3 | scales with how many new scanned PDFs FDA released |
Recovery playbook (if something breaks mid-run)
| Symptom | Action |
|---|---|
npx supabase db push hangs |
Skip → use MCP apply_migration with SQL from .sql file |
Anthropic credit too low mid-Pass-2 |
Run node scripts/extract-observations-gap-fill-openrouter.cjs to finish via free Nemotron |
| Vercel build fails after push | git revert HEAD && git push — code changes are reversible; data changes via migrations are reversible by writing the inverse SQL migration |
Production shows Invalid id |
New record_id prefix detected by Scan E — update regex in src/app/api/form483s/[id]/route.ts:RECORD_ID_PATTERN |
Production shows 0 Observations for visibly-populated PDFs |
Scan A returned >30 → re-run Step 3 with --resume; if still failing, check if PDFs are corrupted FDA-side (mark in form_483_field_discrepancies as source_misclassified) |
download_error HTTP 404 for FDA /media/N/download URLs |
FDA removed the file server-side. Mark in form_483_field_discrepancies with suggested_action='fda_url_dead_404_unrecoverable' so future audits skip them. Session 29 found 6 such cases (id=335, 699, 1075, 1158, 1231, 1291). |
OpenRouter returns HTTP 429 rate-limited upstream mid-batch |
Switch model via --model flag — different upstream providers have separate rate limits. Empirically: Gemma 31B → Nemotron 30B → Gemma 26B chain hits 100% over 3 runs. |
Model returns Cannot read properties of null (reading 'trim') |
The model's output was null/empty (often Nemotron reasoning models). Swap to a non-reasoning model (Gemma 4 26B/31B) and retry. |
Related files
scripts/sync-fda-foia-483s.cjs(Step 1)scripts/fix-fda-483s-country.cjs(Step 2)scripts/extract-observations-gap-fill.cjs(Step 3 — primary)scripts/extract-observations-gap-fill-openrouter.cjs(Step 3 — free fallback)scripts/refresh-company-views.cjs(Step 5)supabase/migrations/20260516000000_backfill_form_483s_location_from_inspections.sqlsupabase/migrations/20260516000001_resync_form_483s_observation_count.sqlsupabase/migrations/20260517000000_resync_observation_count_post_gapfill.sql.claude/docs/data-pipeline-checklist.md(Step 6 reference)src/app/api/form483s/[id]/route.ts(frontend URL pattern — update when adding new record_id prefix)
Future improvements (not blocking)
- Pre-flight Anthropic balance check — script should
messages.createwithmax_tokens=1before starting Pass 2 to detect credit exhaustion early. - Automatic prefix detection — Scan E should
console.logthe new prefix value and refuse to proceed; today it just returns a count. - Cron-scheduled monthly run — wire this skill into
mcp__plugin_oh-my-claudecode_t__schedulefor unattended execution. - Sentry/Slack alert on gate failure — currently the gate is human-triggered; could auto-alert if A/B/C/D/E exceed thresholds.