name: cancer-buddy-organize
description: "Turn a patient's raw medical records (PDF/images/docx) into a canonical, structured patients// directory every other sub-skill can consume. Use when the user hands over a folder of medical records, or says 病历整理 / 我有一堆报告 / 帮我整理报告."
license: MIT
metadata:
author: CancerDAO
version: "0.2.0"
tags: medical-records ocr structuring oncology patient-data readiness
cancer-buddy-organize
Turn raw medical records into structured data every other sub-skill can use.
When to use
- User provides a folder path or set of files (PDF, JPG, PNG, DOCX, ZIP).
- User asks: 病历整理 / 帮我整理这些报告 / 我有一堆检查单.
- Any other sub-skill detects missing
profile.json/readiness.jsonand prompts the user to run organize first.
Inputs
- Path to a folder OR a single PDF/DOCX OR a zip/rar/7z/tar.gz archive.
Outputs
Written under patients/<patient_code>/:
INDEX.md(first line:# patient_code: <code>)profile.json(conforms to../../references/patient-profile-schema.md)timeline.md(human-readable treatment timeline)readiness.json— coverage grade +review_flags[](MTB readiness + suspicious-value audit)review_flags.md— auto-generated human-readable rendering ofreadiness.json.review_flags[](only written when array non-empty)review_summary.md— always written: 1-page checklist of extracted key fields with verbatim source citations, for user spot-check (catches consistent-but-wrong OCR that review_flags can't)case_text.md(consolidated narrative)01_当前状态/…11_诊断证明/(raw file buckets)ocr/(OCR sidecars with SOURCE/CONFIDENCE headers)
Workflow
Resolve input — confirm the user-supplied path with them. For archives, unpack to
/tmp/cb-unpack-$$/first (zip / rar / 7z / tar.gz / single pdf-or-docx). After unpack, the resolved input directory ($src) is what Step 2 plans against.Plan slicing (single-pass vs fan-out) —
glob $srcfor immediate subdirectories, count files, and decide slice boundaries.MAX 15 image files per Phase 1 worker. Claude has a per-conversation total-image budget when many images are loaded into a single context. A worker that tries to OCR 25+ HEIC images in one dispatch will hit "An image in the conversation exceeds the dimension limit for many-image requests" partway through and abort with partial output. (Empirically observed: 24-image slice failed at sidecar 5 of 24.)
Slicing rules:
- Single-pass mode: ≤ 15 files total → one Phase 1 worker
- Sub-directory fan-out: ≥ 2 subdirectories AND each subdir has ≤ 15 files → one worker per subdir
- Sub-directory fan-out with internal split: ≥ 2 subdirectories AND any subdir has > 15 files → split each oversized subdir into halves/thirds (e.g.
h1_part1/h1_part2), one worker per part. Typical case: 73 images across 3 hospitalizations of ~25 each → 6 workers (each hospitalization split into 2 halves of ~12-13 files). - Flat fan-out: no subdirectories, > 15 files → split into N-file chunks (alphabetical or arbitrary), name slices
batch_a/batch_b/etc.
Workers across slices run in parallel (single message, N concurrent Agent tool calls). Within a worker, files run sequentially.
Decide
patient_code: caller-supplied OR auto-generatePT-<hex>fromhash(basename + mtime). Resolvepatient_data_rootfrom$CANCER_BUDDY_PATIENTS_DIR→$VMTB_PATIENT_DATA_ROOT→$HOME/CancerDAO/patients. Computepatient_dir = <patient_data_root>/<patient_code>andmkdir -pits 11 buckets +ocr/+10_原始文件/.Dispatch Phase 1 OCR Workers (parallel) — for each slice, dispatch one
general-purposesubagent in a single message with N tool calls (so they run concurrently, not sequentially). Each worker gets:subagent_type: general-purposedescription: "Organize OCR slice <slice_id>"prompt: the full content ofreferences/organizer-prompt-phase1-ocr.md, with these## Call parametersappended at the end:slice_input_path: <absolute path to the slice's source directory>slice_id: <short logical label — e.g. h1, h2, batch_a>patient_dir: <absolute patient_dir>original_subdir: <relative path under 10_原始文件/ where audit copies go — usually the source subdir's basename>
Each Phase 1 worker writes ONLY to
<patient_dir>/ocr/(sidecars) and<patient_dir>/10_原始文件/<original_subdir>/(audit mirror). They do NOT touch INDEX.md / timeline.md / profile.json / etc — those are Phase 2's job. Workers don't share context, so anti-anchoring is structurally enforced (each worker only sees its slice, no narrative buildup across hospitalizations).Each worker returns:
{slice_id, files_processed, sidecars_written, stub_sidecars, full_ocr_sidecars, ocr_uncertain_files, candidates_files, continuation_needed, continuation_resume_from}.Phase 1 continuation loop — for each worker that returned
continuation_needed: true, dispatch a continuation worker for that slice:"Resume Phase 1 OCR for slice
<slice_id>of<patient_code>. The previous dispatch processed up to<continuation_resume_from>and stopped. Skip every file whose sidecar already exists in<patient_dir>/ocr/(these have lower mtime than source); OCR all remaining files in<slice_input_path>. Return same JSON contract; setcontinuation_needed: falseif done, ortruewith next resume point if context fills again."Loop per-slice until all slices report
continuation_needed: false. Slices that finished cleanly do NOT need re-dispatch; only laggards. This is more efficient than re-dispatching the whole organize.Dispatch Phase 2 Synthesis Worker — after every Phase 1 worker reports
continuation_needed: false, dispatch a SINGLEgeneral-purposesubagent for synthesis:subagent_type: general-purposedescription: "Organize synthesis"prompt: the full content ofreferences/organizer-prompt-phase2-synthesis.md, with these## Call parametersappended:patient_dir: <absolute patient_dir>(current bootstrap path; Phase 2 may rename it in Step 1.7)phase1_summary: <JSON list of all Phase 1 worker results>
Phase 2 reads all sidecars (cross-slice), classifies into the 11 buckets using original basenames (Step 1), then judges per-file
{date, doc_type, 机构, page}plus patient-level{cancer_label, first_dx_date}from the OCR text itself and writes.rename_plan.json(Step 1.5 — semantic judgment, no hardcoded vocab), atomically renames physical files + sidecars + back-fillssource_manifest.tsv(Step 1.6 — mechanical bash, atomic with collision suffix), and renames the patient_dir itself to<cancer>_<YYYY-MM>_<hash4>when OCR yields a recognizable cancer type (Step 1.7). Only after canonical naming does Phase 2 build INDEX.md / timeline.md / case_text.md / profile.json / readiness.json, run the §4.6 review_flags audit, and write review_flags.md (if non-empty) + review_summary.md (always).Phase 2 returns:
{role, patient_dir, patient_dir_original, patient_dir_renamed, files_classified, files_renamed_canonical, files_renamed_skipped, rename_plan_path, ocr_sidecars_read, coverage_complete, missing_sidecars, readiness_grade, readiness_score, blocking_gaps, warnings, review_flags_total, review_flags_red, review_flags_yellow, review_flags_green, review_summary_path}. Thepatient_dirfield is the post-rename path; if the caller still holds the bootstrapPT-<hex>path, usepatient_dir(not the original) for any downstream operations.Coverage gap retry — if Phase 2 returns
coverage_complete: false, dispatch a retry-mini-Phase1 worker with just the missing files as input, then re-run Phase 2. Loop untilcoverage_complete: true. Most runs converge in 0 or 1 retries.Verify outputs — parse Phase 2's returned JSON; confirm
profile.jsonexists and required fields (patient_code,primary_cancer,histology,stage) are populated. If any are missing or null, surface to the user as a blocker before routing to any other sub-skill.Grade readiness — from Phase 2's returned JSON take
readiness_grade+readiness_score. If grade is F or D, present the information-gap checklist 🔴🟡🟢 (derived fromblocking_gaps) to the patient.Display review_summary.md (MANDATORY, ALWAYS) — read the file at
review_summary_pathand display its full content to the user. This is the first thing the user sees after organize — before profile card, before review_flags. It is a 1-page spot-check of extracted key fields with verbatim source citations.Why this is the first display: many real OCR errors produce internally consistent wrong values (e.g. all 7 documents in one hospitalization OCR'd to the same wrong drug name). The 5-check
review_flagsaudit cannot detect those — but a human readingreview_summary.mdcan spot a wrong character in 30 seconds.After displaying, prompt the user: "请核对上面 5 个检查要点。任何字段需要修正,直接告诉我哪个字段 + 正确值,我会更新 profile.json 并重新生成清单。"
Surface review_flags (MANDATORY) — if
review_flags_total > 0, readreview_flags.mdand display its content to the user immediately afterreview_summary.md. This is a hard gate, not optional polish:- If any 🔴 red flag present: tell the user "进入下游 skill 之前请先逐条确认或 override 这些 🔴 项 — 它们会直接影响后续分析与推荐 (若装有 pro-skill: trial-match / mtb-lite / vmtb)"
- If only 🟡/🟢 flags: present them as "建议核对", do not block downstream routing
- If
review_flags_total: 0: still tell the user "所有提取字段已通过 5 项可疑值检查 (格式/跨文档矛盾/临床逻辑/原始证据/数值趋势), 无待确认项 — 但仍请核对上面的 review_summary.md 速查清单" - The user's resolution per flag (
accept_suggestion/keep_original/custom_value/defer) is logged back intoreadiness.json.review_flags[i].user_confirmed = trueplus aresolutionsub-object.
Output profile card — display the Patient Profile Card (references/profile-card.md) to the patient using the
terminology.mdformat rules (中英 + 通俗解释). The card's "🔍 待人工确认" section pulls fromreadiness.json.review_flags[].Downstream gate: do NOT route the user to any downstream sub-skill (education / find-care / vault, or any 若装有 pro-skill analysis route) while any 🔴 red review_flag is unconfirmed. A wrong drug name at this stage poisons every downstream report.
Why fan-out + reduce instead of single-pass
The original design was a single subagent processing every input file sequentially. A 73-image archive took ~33 minutes. Splitting into Phase 1 (parallel per-slice OCR) + Phase 2 (cross-slice synthesis + audit) gives three benefits:
- Speed: 3 parallel Phase-1 workers + 1 Phase-2 finishes in roughly the time of the SLOWEST slice + the synthesis pass — ~3× faster on multi-hospitalization archives in practice.
- Anti-anchoring is stronger: each Phase 1 worker only sees its slice (one hospitalization), so the narrative window the model could anchor on is shorter. Cross-slice contradictions are caught explicitly in Phase 2's §4.6 audit (which has the deterministic cross-doc check) rather than being smoothed over by a single agent's running narrative.
- Better failure isolation: if one slice's worker hits context exhaustion, only that slice retries (continuation loop). Slices that finished cleanly are not re-dispatched.
Single-pass is preserved for small inputs (< 30 files OR no subdirs) — the parallelism overhead isn't worth it.
patient_code collision
If the generated patient_code (e.g. PT-17CE02BC33) already exists under the patients root, the subagent appends _2, _3, etc., and announces the assigned code in the summary.
Configurable root
The patients/ root resolves in order: $CANCER_BUDDY_PATIENTS_DIR → $VMTB_PATIENT_DATA_ROOT → $HOME/CancerDAO/patients. Override by exporting one of those. Shared with vmtb-skill.
Safety
Organize does not make medical recommendations. Still:
- Never fabricate fields — when a value is truly unreadable in the source, the subagent writes
null(JSON) or[OCR_UNCERTAIN](text) and surfaces it as a gap. - Downstream sub-skills apply the full
safety-guardrails.mdrule set when they read what organize produced; wrong data here poisons every downstream report. 10_原始文件/is the audit trail — always a byte-identical mirror of every source file.
Next-step guidance
After successful organize, route the patient to the most relevant next companion sub-skill based on their initial question:
- Newly diagnosed, wants to understand their condition →
cancer-buddy-education - Wants help finding the right hospital / specialist →
cancer-buddy-find-care - Wants to keep their organized records as a personal data vault →
cancer-buddy-vault
If the user has the private pro-skill bundle installed, deeper analysis routes become available — gate any mention of them behind "若装有 pro-skill":
- 若装有 pro-skill, newly diagnosed wanting maximal diagnostics →
cancer-buddy-explore - 若装有 pro-skill, has a gene report and wants treatment guidance →
cancer-buddy-mtb-lite - 若装有 pro-skill, looking for clinical trials →
cancer-buddy-trial-match
Role behavior
Authoritative matrix in ../../references/roles.md. For this skill:
- Role = patient: First-person. "帮我整理我的病历" → produce profile.json / timeline.md / readiness.json. Profile's
data_sources[]names patient as source.- Disclosure: disclosure_state=suppressed on patient entry → warn that organize will likely break suppression; proceed only with confirmation.
- Role = caregiver: Second-person. "帮你家人整理报告". Tone warmer, includes "整理这些很累吧,一步一步来"-style acknowledgment. On first-ever organize in this patient_code, organize creates the profile but does NOT write
profile.json.caregivers[]itself — that array is owned bycancer-buddy-caregiver(a documented exception in../../references/patient-profile-schema.md, which writes the caregiver's relation + name + contact preference). Offer to hand the user off tocancer-buddy-caregiverto record who they are. - Role = family: Refuse. Emit:
病历整理要靠主照护者操作(Ta 手里有原件)。要不要我帮你生成一份 2 页要点让 Ta 参考?Do not run organize.
References
- organizer-prompt-phase1-ocr.md — Phase 1 worker prompt: per-slice OCR, parallel-safe, sidecars-only
- organizer-prompt-phase2-synthesis.md — Phase 2 worker prompt: cross-slice synthesis + Step 1.5–1.7 canonical naming (semantic judgment + atomic bash mv) + review_flags audit + review_summary
- profile-card.md — Patient Profile Card display template
- ../../references/patient-profile-schema.md — schema contract shared with vmtb-skill
- ../../references/preflight.md — shared entry-gate (role + disclosure + readiness grade + Step 2.5 review_flags red gate + schema validity)
- ../../references/terminology.md — 中英 + 通俗解释 format
- ../../references/safety-guardrails.md