m62-altitude-onboarding

name: m62-altitude-onboarding description: "Extract entity data from household document folders (PDFs, Word docs, images, spreadsheets) and update the Altitude (Altcore) platform via API. Queries Altitude first to find existing households and their universe of entities (Individuals, LegalEntities, AccountFinancials, Contacts, TangibleAssets, Households), extracts data from documents, matches and merges against existing records (filling empty fields, flagging conflicts), creates relationships, and uploads documents to the correct entity. Use this skill whenever the user mentions Altitude, Altcore, onboarding families, extracting entity data from documents, updating households, processing client folders, or uploading documents. Also trigger when the user has a folder of family documents (trusts, LLCs, tax returns, IDs, insurance, estate plans, bank statements) and wants to populate a wealth management platform."

Altitude Document Extraction & Entity Update

⛔ CRITICAL RULE: You MUST read EVERY SINGLE FILE in the household folder. Not most files. Not the important-looking files. ALL files. Write a file tracker (altitude_review/file_tracker.md) listing every file. Mark each READ as you go. Do NOT proceed to Phase 4 until the tracker shows 100% READ. If you read 22 out of 60 files, you have failed. This is the #1 cause of extraction failure — see "Zero-Skip Rule" in Phase 3.

This skill extracts entity data from household document folders and updates the Altitude platform. It follows a query-first, match-and-merge approach — never blindly creating entities. Every change is reviewed before pushing.

⛔ CRITICAL RULE: NEVER INVENT VALUES. If a field is not stated in the source documents, set it to null/omit it — do NOT default, estimate, round, or substitute a placeholder.

This is non-negotiable. A null field correctly signals "data missing — needs RM follow-up". An invented value ($1,000,000, "UNKNOWN", 2024-01-01, "Anytown, USA") masquerades as authoritative data and silently corrupts every downstream report, valuation rollup, risk analysis, and MCP query that consumes it.

Verita 2026-05-04 audit found 3 Levine real-property TAs with insuredValue: 1000000.0 on assets worth $7.7M to $17M — placeholder values invented when the source PDFs didn't include actual coverage amounts. The MCP downstream consumer treated them as real and built risk analysis on top. Same root pattern: primaryInsurancePolicyNumber: "UNKNOWN" as a literal string instead of null.

The mandate, applied to every entity field on every POST/PATCH the skill emits:

Source-document state Field value to send Do NOT do

Field stated explicitly The exact stated value —

Field stated as range / approx Lower bound + a note explaining the range Don't pick the midpoint

Field implied but not stated null + add an open-question Don't infer from context

Field absent from source null/omit from POST Don't default to $0, $1M, "UNKNOWN", "N/A", today's date, etc.

Field unreadable (cloud-stub, OCR fail) null + flag in file_tracker Don't guess

Particularly forbidden placeholder patterns (these are all real bugs we've cleaned up):

1000000 / 1000 / 100000 as round-number fallbacks for insuredValue, coverageAmount, currentValue

"UNKNOWN", "N/A", "TBD", "Pending" as literal string values (use null)

2024-01-01 / 2025-01-01 / today's date as fallback for effectiveDate, purchaseDate, valuationDate, dateOfBirth

VINs ending in 123456 or other obviously-synthetic patterns

SSN 000-00-0000 / 111-11-1111

"Anytown, USA" / "123 Main St" style address placeholders

0.0 for percentage on an OWNERSHIP edge when the real percentage is unknown (omit the edge entirely or use economicOwnership: false)

When in doubt, leave it null and add an entry to altitude_review/open_questions.json so the RM has a tracked follow-up. A field with null and a documented question is recoverable. A field with an invented value is a silent data-poisoning event that may go undetected for months.

Source-document state	Field value to send	Do NOT do
Field stated explicitly	The exact stated value	—
Field stated as range / approx	Lower bound + a note explaining the range	Don't pick the midpoint
Field implied but not stated	`null` + add an open-question	Don't infer from context
Field absent from source	`null`/omit from POST	Don't default to $0, $1M, "UNKNOWN", "N/A", today's date, etc.
Field unreadable (cloud-stub, OCR fail)	`null` + flag in `file_tracker`	Don't guess

⛔ COROLLARY: NEVER COPY VALUES BETWEEN FIELDS WITH DIFFERENT SEMANTICS.

A value that's correct on entity X with field A may be wrong on entity Y with field B even though the number is the same. The schema's field names and units are part of the contract — respect them. When unsure, leave the target field null and document.

Real bugs we've cleaned up:

Source Wrong target Why it's wrong

Auto policy coverageAmount ($500K combined-single-limit liability) each covered vehicle's insuredValue insuredValue = per-asset stated/replacement cost. Liability cap is policy-level, not asset-level. Putting $500K on a Cadillac Escalade ($87K market value) implies absurd over-insurance.

Trust marketValue of underlying assets Trust LE's currentValue Trust value rolls up from owned assets via the OWNERSHIP graph, not via a flat field on the trust itself.

Insurance policy effectiveDate Asset's purchaseDate "When did coverage start" ≠ "when did you buy this".

Mortgage originationDate Property's acquisitionDate A property can be re-mortgaged or have multiple mortgages over its life.

Decision test before assigning a number to a field: Read the field's description in the schema. State out loud what the field means. State what the source value means. If the two definitions don't match exactly — including units, scope (per-asset vs per-policy), and timing — leave the target null and add an open-question.

Decision test for "the doc has a number near this entity": Just because a number appears in a document attached to entity X does NOT mean that number belongs in any of entity X's fields. The number may be:

A line-item detail from a related but distinct entity

A historical value (closing price 5 years ago, not current value)

A net-of-something calculation (NAV after debt) that doesn't match the requested field (gross asset value)

A scope-level value (policy-wide premium ÷ N covered items ≠ per-item premium)

When the doc has the right entity but wrong field, capture it in the related entity's correct field and leave the requested field null. Never coerce a known wrong-field value into the right-field slot.

Source	Wrong target	Why it's wrong
Auto policy `coverageAmount` ($500K combined-single-limit liability)	each covered vehicle's `insuredValue`	`insuredValue` = per-asset stated/replacement cost. Liability cap is policy-level, not asset-level. Putting $500K on a Cadillac Escalade ($87K market value) implies absurd over-insurance.
Trust `marketValue` of underlying assets	Trust LE's `currentValue`	Trust value rolls up from owned assets via the OWNERSHIP graph, not via a flat field on the trust itself.
Insurance policy `effectiveDate`	Asset's `purchaseDate`	"When did coverage start" ≠ "when did you buy this".
Mortgage `originationDate`	Property's `acquisitionDate`	A property can be re-mortgaged or have multiple mortgages over its life.

Life-event modes (detect early, branch appropriately)

Before anything else, sniff the folder for life-event signals that change how this flow should behave. If any of these markers are present, add a banner to the review and branch.

Signal in folder	Mode	Implications
`divorce`, `MSA`, `Judgment`, `schedule of assets`, `FL-150`, `<client> divorce documents`	Divorce / post-divorce	Expect HISTORICAL spouse, joint-trust division, community property allocation. Expect the final MSA/Judgment to be authoritative — if missing, defer ownership decisions. Treat ex-spouse as Contact (optionally Individual for historical SPOUSE).
`prenuptial`, `transmutation`, `Cal Fam Code §852`	CP transmutation	All pre-marital separate property may now be community. Don't assume pre-2025 ownership carries forward.
`estate of`, `probate`, `letters testamentary`	Post-death	Primary individual may be deceased. Use `lifecycleStatus=DECEASED` and `dateOfDeath`. Estate is the active entity, not the individual.
`prospect`, unsigned client agreement	Pre-engagement	Record client-since date. Anything dated before the signed client agreement is prospect data.
Folder from partner firm (e.g. Share) with `USE THESE PER ...` or `LATEST` or `FINAL` directory/filename prefixes	Authority markers	Prefer files in authority-marked folders over other sources when resolving conflicts.

For the Divorce mode specifically: (a) Every joint-titled asset belongs in Tier B — Pending MSA until the final judgment specifies allocation. Don't create joint-trust accounts under the client's Household without flagging. (b) The ex-spouse is NOT a client. Create as a Contact with role "Former spouse / counterparty" OR as an Individual with SPOUSE relationship marked HISTORICAL (set effectiveTo to the decree date once known). (c) Family trusts that existed before the divorce are typically being divided — track them as HISTORICAL and create the NEW post-divorce trusts as current entities.

Fund-entity flood (aggregate vs create individually)

When the client's household includes an operating partner at a VC/PE firm, or any limited partner in 50+ investment vehicles (Accel, Sequoia, KKR, Accel-India, IDG-Accel China, etc.), you will extract hundreds of partnership LegalEntities (Accel XVI Investors LLC, Accel London VII LP, etc.). These are typically already tracked in Addepar at the position level.

Default rule: DO NOT create individual LegalEntity records for investment fund vehicles. Instead:

Track aggregate exposure as supplemental attributes on the parent trust or account ("Total Accel carry: $120M unrealized, $40M side-funds")
If the client wants entity-level tracking, create a single umbrella LegalEntity (e.g. "Accel Carry — the household's Trust") with supplemental attributes listing the component funds
Full fund list stays in the extraction cache (altitude_review/extraction_cache.jsonl) for audit / future refinement

Absence-as-data: empty folders and missing documents

Wealth management firms often use standardized folder templates. When you encounter empty folders (Insurance/, Investments/, Client Reporting/, Miscellaneous/), treat them as absence signals, not ignore-able:

Empty Insurance/ → "No insurance documents collected yet" → open question: does client have policies we need to request?
Empty Investments/ → "No investment docs" → is this because investments are via a separate custodian, or because we haven't collected them?
Empty Client Reporting/ → "Client may be pre-engagement" → confirm client-since date

Record absence facts in altitude_review/open_questions.json and the review, don't drop them.

Addepar-sync provenance (do not clobber synced fields)

Any entity whose externalIds includes provider: ADDEPAR is populated by nightly Addepar sync. PATCHing Addepar-owned fields risks having your changes overwritten on the next sync.

Rule of thumb before PATCHing:

Check externalIds on the entity — if provider=ADDEPAR exists, the Addepar sync owns:
- Account: accountNumber, custodianId, accountCategory, provider-side balances, position data
- Individual: synced name + DOB from the Addepar "party" record
- Household: the Addepar hierarchy name
Only PATCH fields that are NOT owned by Addepar (estatePlanning, email, phone, supplementalAttributeValues, Altitude-side metadata)
If you must PATCH a synced field, leave a note in the review flagging the sync conflict risk

Auto-HISTORICAL SPOUSE when divorce signals are present

If any file in the folder matches the Divorce / post-divorce life-event signals (MSA, Judgment, schedule of assets, FL-150, protective-order stipulation, "divorce decree", etc.), the SPOUSE relationship MUST be created HISTORICAL, not current:

{
  "relationshipType": "SPOUSE",
  "sourceEntityType": "INDIVIDUAL", "sourceEntityId": "<clientA>",
  "targetEntityType": "INDIVIDUAL", "targetEntityId": "<clientB>",
  "effectiveTo": "<decree date | stipulation date | best available divorce-milestone date>",
  "role": "Former spouse (divorced <date> per <source doc>)"
}

Priority for effectiveTo:

Final MSA / Judgment of Dissolution date (if in folder)
Court-filed stipulation date (e.g. "Stip re Protective Order [F.MM.DD.YY]")
Date of earliest divorce filing visible in the folder
If none available, create the SPOUSE WITHOUT effectiveTo but flag in Open Questions "No divorce decree date found — SPOUSE marked current until MSA is produced"

DO NOT create the SPOUSE as current with role: "Spouse (separation pending divorce)" — the frontend renders it as "married" regardless of role. Use effectiveTo to make it HISTORICAL. If unknown, either omit the relationship entirely or flag the date gap.

Caveat — field-nulling limitation: Spring merge-patch ignores null values on entity-relationship PATCH, so setting effectiveFrom to null after creation does NOT work. Always set effectiveFrom correctly (or OMIT it) AT CREATION time. For SPOUSE where the marriage date is unknown, omit effectiveFrom at POST time rather than defaulting to "today".

Invariant — temporal ordering (R-W8 amendment, Comolli rerun 2026-04-29): effectiveFrom <= effectiveTo MUST hold on every SPOUSE/MARRIAGE-related edge before POST. Comolli's R-W8 audit found a SPOUSE edge with effectiveFrom=2026-04-20 (data-entry date) and effectiveTo=2026-01-29 (decree date) — effectiveTo < effectiveFrom, an inverted temporal pair. This happens when the agent uses today's date as the marriage date instead of OMITTING effectiveFrom per the rule above.

Pre-POST check the agent MUST run on any SPOUSE/MARRIAGE/PARTNER edge:

if effective_from and effective_to and effective_from > effective_to:
    raise ValueError(
        f"SPOUSE edge has effectiveFrom={effective_from} > effectiveTo={effective_to}; "
        "marriage date must precede decree date. If marriage date is unknown, "
        "OMIT effectiveFrom at POST time."
    )

If the inverted edge already exists in production (per Comolli), the remediation is PATCH /entity-relationship/{id}/attributes to clear effectiveFrom (best-effort if Spring merge-patch ignores it) OR hard-delete + re-POST with the corrected ordering — preserve the audit trail by setting the correct dates rather than null-ing.

Always create PARENT/CHILD edges for household children

The Household→Individual OWNERSHIP relationship (skill default) establishes MEMBERSHIP but NOT family structure. The frontend's family tree, estate plan chart, and beneficiary flowchart ALL depend on PARENT (and inverse CHILD) edges. After creating the Household OWNERSHIP edges, also create:

For each minor-or-adult child in the household with at least one identified parent:

{
  "relationshipType": "PARENT",
  "sourceEntityType": "INDIVIDUAL", "sourceEntityId": "<parent-individual>",
  "targetEntityType": "INDIVIDUAL", "targetEntityId": "<child-individual>",
  "effectiveFrom": "<child's DOB>",
  "role": "Biological parent"  // or "Adoptive parent" ONLY — see step-parent rule below
}

A PARENT edge means biological or legal (adoptive) parent — NEVER a step-parent. The frontend now derives full-vs-half sibling status by counting how many PARENT edges two children share (see the SIBLING rule below). Wiring a step-parent as a PARENT edge silently inflates that count and mislabels half-siblings as full (and a step-mother as "Mother"). Step-parents are handled separately — see "Step-parents" below.

Cardinality: PARENT has maxCardinality: 2 on the target — a child can have max 2 PARENT edges. Create one per identified biological/legal parent. If only one parent is identified (e.g. the non-client parent is unknown, deceased, or simply not named in any document), create just the one edge — do NOT invent or guess the second parent to "complete the pair." A child legitimately having only one known parent is the correct representation of a half-sibling or single-documented-parent case.

Include the ex-spouse as a parent: In post-divorce cases where the ex-spouse is the children's other parent, the ex-spouse needs to be an Individual (not just a Contact) to serve as the PARENT source. Always create the ex-spouse as an Individual with HISTORICAL SPOUSE, THEN create PARENT edges from BOTH parents to each child.

Inverse CHILD edges: the EntityRelationshipType enum maps PARENT ↔ CHILD as inverse reciprocals (see getInverseType()). Depending on the backend version, creating PARENT may or may not auto-create CHILD. Verify after creation; if missing, create CHILD explicitly.

SIBLING edges — full vs half (do NOT fabricate shared parents)

Create a SIBLING edge (symmetric) between two household members who are siblings. The concrete enum is SIBLING; use the role field for the human label:

{
  "relationshipType": "SIBLING",
  "sourceEntityType": "INDIVIDUAL", "sourceEntityId": "<sibling-a>",
  "targetEntityType": "INDIVIDUAL", "targetEntityId": "<sibling-b>",
  "role": "Brother"   // or "Sister" / "Half-brother" / "Half-sister"
}

Full vs half is determined by shared biological parents, and the frontend computes it automatically from the PARENT edges — it counts how many parents the two children share:

shares both parents → full sibling (Brother / Sister)
shares exactly one parent → half sibling (Half-brother / Half-sister)

Because the FE derives this from PARENT edges, the integrity of the half/full distinction depends entirely on PARENT edges being accurate. Therefore:

NEVER add a PARENT edge to a parent that a document does not actually attest just to make two children look like full siblings. If the documents only establish that Sam and Liza are children of the father (and never name a shared mother with the principal), they are half-siblings — leave them with the single PARENT edge.
Set the SIBLING role to match the documented reality (Half-brother/Half-sister when only one parent is shared). If unsure whether full or half, set role: "Sibling" and file an Open Question rather than guessing "Brother/Sister".
Watch for near-homograph name traps when siblings and step-parents coexist (e.g. "Liza" the half-sister vs "Lisa" the step-mother). Confirm by ID, not by name string.

Step-parents — SPOUSE-of-parent, never a PARENT edge

A step-parent is the current spouse of one of the child's biological parents who is NOT themselves a biological/adoptive parent of that child. Represent them as:

A SPOUSE edge between the step-parent and the biological parent they are married to (this is what the FE reads to label them "Step-mother"/"Step-father" relative to the household principal), and
(optional, if useful for search) a FAMILY edge from the step-parent to the child with role: "Step-parent".

Do NOT create a PARENT edge from a step-parent to a step-child unless there is a legal adoption on record (then it is an Adoptive parent PARENT edge, and the child genuinely has that person as a counted parent). A step-parent wired as PARENT will make the FE render them as a biological parent and will corrupt every full/half-sibling label among that parent's children.

Worked example (the canonical remarried-parent shape): Principal P's father F remarried SM (step-mother) after divorcing P's mother M. P and full-sibling have PARENT edges to both M and F. P's paternal half-siblings have a PARENT edge to F only (their mother is not M and is typically not in the dataset). SM gets a SPOUSE edge to F and no PARENT edge to anyone. Result the FE renders from P's perspective: F = Father, M = Mother, SM = Step-mother, full-sib = Brother/Sister, paternal sibs = Half-brother/Half-sister. Do not "fix" the half-siblings' missing second parent — it is correct.

Always wire estate-plan / fiduciary-role parties to the household (visibility-only)

When an estate-plan document — a will, revocable/irrevocable trust, durable POA, advance healthcare directive (AHCD), HIPAA authorization, guardian nomination, beneficiary designation form, or trustee certification — references an individual in a fiduciary or designee role, that person MUST be created AND wired to the household, even if they have no economic ownership.

Common roles seen in trust/estate documents (all of these trigger the rule):

HEALTHCARE_AGENT (primary or alternate, per AHCD)
GUARDIAN (of person or estate of a minor — incl. nomination committee members)
BENEFICIARY / SPECIFIC_GIFT_BENEFICIARY (contingent or remainder)
EXECUTOR (of a will)
TRUSTEE / SUCCESSOR_TRUSTEE / SUCCESSOR_SPECIAL_TRUSTEE
TRUST_PROTECTOR
POWER_OF_ATTORNEY (financial agent — could be primary or springing)
GRANTOR (when not the household principal)
INSURED (of a policy) — when a non-household-principal individual is insured

Why: These individuals get extracted during Phase 3 (the document references them by name with a fiduciary role), but they are NOT economic owners of any account, asset, or LE — so the skill's default OWNERSHIP-edge logic skips them and they end up as parentHouseholdId=null orphans. Verita 2026-05-04 cleanup found 4 such orphans in the Comolli household alone (Bret Comolli, Marney Jurey, Toby Broke-Smith, Jennifer Connolly — all named as healthcare agents or guardian-committee members in Hannah's Feb 2026 estate plan but never wired to the HH).

The rule: For every individual extracted from an estate-plan or trust document, create a Household→Individual visibility-only OWNERSHIP edge IN ADDITION to any role-specific edges from the document:

{
  "relationshipType": "OWNERSHIP",
  "sourceEntityType": "HOUSEHOLD",
  "sourceEntityId": "<household-id>",
  "targetEntityType": "INDIVIDUAL",
  "targetEntityId": "<individual-id>",
  "percentage": 100,
  "economicOwnership": false,
  "role": "Estate-plan party (visibility-only, see role-specific edges)"
}

economicOwnership: false makes the edge invisible to the 100%-sum validator on ownership totals — it's purely a graph-membership signal so the individual rolls up under the household for UI display, family-tree traversal, and search results.

Then ALSO create the role-specific edges per the source document:

// Toby Broke-Smith is Hannah's primary healthcare agent
{ "relationshipType": "HEALTHCARE_AGENT", "sourceEntityType": "INDIVIDUAL",
  "sourceEntityId": "<toby-id>", "targetEntityType": "INDIVIDUAL",
  "targetEntityId": "<hannah-id>", "isPrimary": true,
  "role": "Primary healthcare agent per AHCD dated YYYY-MM-DD" }

// Jennifer Connolly is the alternate
{ "relationshipType": "HEALTHCARE_AGENT", "sourceEntityType": "INDIVIDUAL",
  "sourceEntityId": "<jennifer-id>", "targetEntityType": "INDIVIDUAL",
  "targetEntityId": "<hannah-id>", "isPrimary": false,
  "role": "Alternate healthcare agent per same AHCD" }

// All 5 committee members for Celeste's guardian-of-person nomination
{ "relationshipType": "GUARDIAN", "sourceEntityType": "INDIVIDUAL",
  "sourceEntityId": "<committee-member-id>", "targetEntityType": "INDIVIDUAL",
  "targetEntityId": "<minor-child-id>",
  "role": "Guardian-of-person nomination committee member (majority nominates)" }

Edge-cardinality + Ind→Ind type-restriction reminders (validated against live API 2026-05-04):

GUARDIAN: maxCardinality: 1 per target. Only the active primary guardian gets a GUARDIAN edge. Backup-committee members CANNOT use RELATED_PARTY or ASSOCIATED_WITH Ind→Ind — the backend rejects both with HTTP 400 "Individual cannot have Related Party / Associated With relationship with Individual". Capture committee membership instead in:
1. The primary guardian's notes field (free text describing the committee), OR
2. A document-level note on the will/estate plan PDF, OR
3. The minor child's individual description / notes field (e.g., "Guardian-of-person nomination committee: Marney Jurey, Toby Broke-Smith, Carter Comolli, Bret Comolli, Jennifer Connolly"). Don't try to force per-member edges — the backend will reject every retry.
HEALTHCARE_AGENT: Ind→Ind is allowed. No max-cardinality. Use isPrimary: true on exactly one and isPrimary: false on alternates.
EXECUTOR: see cardinality rules in the API spec; typically primary + alternates.
BENEFICIARY: Ind→LE is allowed (e.g., contingent remainder beneficiary of a trust); Ind→Account/Insurance/etc. is allowed. Use isPrimary: true/false to distinguish primary vs contingent.

Soft-delete uniqueness gotcha (known platform bug): If an HH→Individual OWNERSHIP edge fails with HTTP 409 "A record with this information already exists" but no live edge is visible via /individual/{id}/relationships/to, that's the documented uniqueness constraint including soft-deleted records (open backend ticket as of 2026-05-04). The HH→Individual visibility edge cannot be re-created until engineering DB-direct-deletes the soft-deleted ghost. Workaround: rely on the role-specific edges (HEALTHCARE_AGENT, BENEFICIARY, etc.) for graph connection — the individual will appear in the household's relationships graph even though parentHouseholdId won't auto-derive without the OWNERSHIP edge.

Distinguish "household member" from "estate-plan party":

A primary household individual (spouse, child, parent of household principal) → HH→Ind OWNERSHIP at 100% with economicOwnership: true. They are HH members.
A fiduciary designee or non-resident relative referenced only in estate planning → HH→Ind OWNERSHIP at 100% with economicOwnership: false. They are visibility-only graph members so the household's estate-plan flowchart resolves correctly.

When NOT to apply this rule: If the same person already has an economicOwnership: true HH membership edge (i.e., they are a primary household member who also serves as healthcare agent), do NOT add a second visibility edge — just add the role-specific HEALTHCARE_AGENT edge alone. One person, one HH-membership edge.

Keep flat fields in sync with the OWNERSHIP graph (especially ILIT flags)

Several entity fields are duplicates of information already expressed in the relationship graph. Today the platform does NOT auto-derive these flat fields from edges (planned backend fix per PLT-88), so the skill must keep them consistent at creation time. Drift silently misrepresents the data; the most impactful case is isIlitOwned on InsurancePolicy because it changes estate-tax modeling.

Rule for life-insurance policies: when the policy's owner is an ILIT (Irrevocable Life Insurance Trust), set BOTH:

{
  "isIlitOwned": true,
  "ilitLegalEntityId": "<le-id-of-the-ILIT>"
}

…on the InsurancePolicy POST/PATCH body, AND create the OWNERSHIP edge from the ILIT to the policy.

ILIT detection criteria (a LegalEntity is an ILIT when ALL hold):

entityType: TRUST
Either trust.isIrrevocable: true (canonical signal), OR the trust name matches the case-insensitive regex \b(ILIT|irrevocable|insurance trust|life insurance trust)\b
The name does NOT match \brevocable trust\b as a whole token — beware that "revocable" is a substring of "irrevocable" and a naive contains-check will false-match. Use word-boundary regex: (?:^|[^a-z])revocable\s+trust.

Common ILIT name patterns (from Verita 2026-05-05 cleanup):

<Name> Family <Year> Irrevocable Trust (e.g., "AL Family 2020 Irrevocable Trust")
<Insured Name> Irrevocable Trust (e.g., "Kevin Dawdy Irrevocable Trust")
The <Insured> Irrevocable Trust (e.g., "The Roger A McIntosh Irrevocable Trust")
<Name> ILIT (literal acronym usage)
<Name> Insurance Trust / <Name> Life Insurance Trust

The reverse case (don't set the flag wrongly): if the policy is owned by an Individual or by a revocable trust (e.g., "John Smith Revocable Living Trust"), isIlitOwned MUST be false and ilitLegalEntityId must be null. A revocable trust is part of the grantor's estate for tax purposes, so it does NOT confer ILIT treatment.

Why this matters: the whole point of an ILIT is to keep death benefit outside the insured's taxable estate. A misflagged policy:

Distorts UI views ("Estate-Tax Exposure" panel reads from the flat field)
Distorts MCP responses (get_insurance_info exposes the flag to LLM consumers)
Causes wrong estate-tax projections — advisor may over-buy ILIT-replacement coverage thinking the policy is in-estate when it's actually out, or under-prepare for tax when the platform shows ILIT-owned but it's actually in-estate.

Same pattern applies to other flat-field-vs-graph duplicates (audit and keep in sync at creation):

InsurancePolicy.isInsuredByOwner (when insured = a person who's also one of the owners) — set true, otherwise false/null
TangibleAsset.isInsured — set true if there's an active OWNERSHIP edge from an InsurancePolicy or if primaryInsurancePolicyId is populated
TangibleAsset.totalLiabilityBalance — should reflect sum of LIABILITY records linked to this asset; do not invent a number, leave null if no liabilities are linked
Trust subtype booleans on LegalEntity (isRevocable, isGrantor, isIntentionallyDefective) — set from the trust agreement document; if not determinable, leave null and add open-question

When in doubt about whether a flat field is auto-derived or skill-maintained, check by querying a recently-created entity: does the field match the graph state? If it matches, the platform derives it; if not, the skill must set it.

Trust agreement PDFs require structured extraction (with OCR fallback)

Trust agreement PDFs are the canonical source for LegalEntity.trust.* fields (isRevocable, isGrantor, governingLaw, situs, hasPourOverProvision, hasSpendthriftProvision, isRestatement, etc.). They are MANDATORY reads when present.

Phase-3 extraction must capture these fields from any document with documentSubType in the trust-document family:

TRUST_AGREEMENT, TRUST_CERTIFICATION, TRUST_AMENDMENTS, REVOCABLE_TRUST_DOCUMENT,
IRREVOCABLE_TRUST_DOCUMENT, TRUSTEE_CERTIFICATION, RESTATEMENT

Critical lesson — restatements ARE trust agreements: A document titled "Second Amendment and Restatement of the X Trust" or "Second Restatement of the Y Trust" contains the FULL current trust agreement. These are often classified as TRUST_AMENDMENTS but should be treated as TRUST_AGREEMENT for extraction purposes — the restatement supersedes the original. Verita 2026-05-05 audit found 4 trusts where the restatement was the only agreement-content doc and was classified as TRUST_AMENDMENTS or OTHER, causing the skill to skip it.

Reclassification pass: During Phase 2 doc scan, if you find a doc with documentSubType=OTHER or documentSubType=TRUST_AMENDMENTS whose filename contains (Restatement|Amendment.*Restatement|Trust Agreement|Trust Agt), RECLASSIFY it to TRUST_AGREEMENT via PATCH /document/{id}/metadata before Phase 3 extraction.

Required extraction fields per trust agreement

Field	How to detect
`isRevocable`	Title declares `IRREVOCABLE TRUST` → false. Body has "this trust is irrevocable" → false. Body has `we reserved the right to amend/revoke` / `Revocable Trust` declared in body → true. Document title literally `<Name> Revocable Trust` → true (sanity-check against body).
`isGrantor`	Body references "Grantor Trust Provisions" article, IRC §671/§672/§673/§674/§675/§676/§677/§678, or "intentionally defective grantor trust" → true. Otherwise null.
`governingLaw`	Search for `governed by the laws of the State of <X>` / `construed in accordance with the laws of <X>` / notary block `STATE OF <X>` (which often shows the situs state if no explicit clause).
`situs`	Look for `situs of this trust shall be <X>` / `principal place of administration shall be in <X>`. Falls back to `governingLaw` for many trusts.
`hasPourOverProvision`	Body mentions `pour-over will/provision/trust` → true.
`hasSpendthriftProvision`	Body mentions `spendthrift provision/clause/trust` → true.
`isRestatement`	Title or first page contains `Amendment and Restatement` / `Restatement of` → true.

When the PDF can't be text-extracted (font-encoding or scanned)

Many trust agreements use embedded fonts with custom CID maps (common from Kirkland & Ellis, Loeb & Loeb, etc.) — pdftotext returns garbled output or zero chars. Fallback to OCR via tesseract:

# Convert pages 1-12 to PNG (most trust provisions land in first 12 pages)
pdftoppm -png -r 200 -f 1 -l 12 trust.pdf trust_page

# Run tesseract on each page (use relative path — tesseract has macOS quirks with
# absolute paths under sandbox)
cd /tmp/trust_ocr
for p in trust_page-*.png; do
    tesseract "$p" "${p%.png}" -l eng --psm 1
done

OCR'd pages 1-5 typically give title + table of contents (not enough to determine revocable/irrevocable). Pages 6-12 typically include the substantive provisions where the revocable/irrevocable declaration appears. Always OCR at least pages 1-12, longer if needed.

When the trust agreement is NOT in Discovery

If a trust LE exists in Altitude (often from CRM/balance-sheet onboarding) but no trust agreement document is in the family's Discovery folder:

Do NOT invent values for any trust subtype field

Add an entry to altitude_review/open_questions.json:

{
  "category": "missing_data", "blocking": false,
  "entity": "<Trust Name>", "entityId": "<le_id>",
  "question": "Trust agreement document for '<Trust Name>' not in Discovery — request from family attorney to populate isRevocable / isGrantor / governingLaw / situs / GST exemption status",
  "rationale": "Trust LE was created without a source document. Trust subtype fields will remain null until agreement is uploaded."
}

Once RM uploads the missing agreement, re-run the skill to fill the fields.

Vendor firms and institutions are Contacts, NOT LegalEntities

Rule: A LegalEntity in Altitude represents an entity the household has an ownership, beneficial, fiduciary (as grantor/trustee/beneficiary), or membership interest in — trusts they created, LLCs they own, partnerships they're a partner in, corporations they hold shares of. Everything else is a Contact, even if it is technically a corporation in the real world.

Do NOT create a LegalEntity for:

Corporate trustees / executor firms providing fiduciary service (e.g. fiduciary trust companies, professional executor services). Create the company as a Contact with jobTitle: "Corporate Trustee" or biography: "<company name>"; use an individual officer's Contact if a specific person is named in the trust/will.
Schools / universities the client or their children attend. Create as a Contact with jobTitle: "School" or similar, or just note in Individual's supplemental attributes.
Custodian banks (Schwab, Fidelity, Merrill, Wells Fargo, etc.) — these are modeled separately as Custodian entities on accounts. Never a LegalEntity.
Law firms, accounting firms, advisory firms whose individual professionals we've already created as Contacts. The firm name lives on the Contact's biography.
Investment fund vehicles the client indirectly holds via a carry/side-fund interest (see "Fund-entity flood" — aggregate at the parent trust, don't create per-fund entities).
Government agencies, courts, tax authorities, registrars mentioned in documents.
Vendors (property management companies, insurance brokerages, auction houses, galleries, etc.) — Contact with biography identifying the firm.

DO create a LegalEntity for:

Trusts the household is the grantor/beneficiary/trustee of
LLCs / LPs / partnerships the household owns or is a member/partner of
Corporations the household holds shares in (if material and tracked at entity level)
DAFs and private foundations the household funded
Operating companies the household controls
Holdco entities in the household's ownership chain

If in doubt, ask: "Does the household have an ownership, fiduciary, or beneficial interest in this entity, OR is it just providing a service?" Services → Contact. Interest → LegalEntity.

Example miss: a recent onboarding created "Trust Company X" (a corporate trustee providing fiduciary service) as a LegalEntity with EXECUTOR and SUCCESSOR_TRUSTEE relationships pointing to it. Correct model: Trust Company X is a Contact (the individual officer or the company with jobTitle: "Corporate Trustee"), and the EXECUTOR / SUCCESSOR_TRUSTEE relationships originate from the Contact using the CONTACT→INDIVIDUAL / CONTACT→LEGAL_ENTITY validator rules.

Firm users are NOT Contacts — check before creating

Advisors, analysts, COOs, client-service staff, and any other employee of the firm that owns this household are already system Users and will be attached to the household via its FirmTeam membership (separate admin flow). Do NOT create them as per-household Contacts.

Mandatory precheck — before POSTing ANY Contact:

# Get the firm's users once at the start of the run and cache them
curl -s "${BASE}/user?firmId=${FIRM_ID}&size=200" -H "X-API-Key: ${API_KEY}" \
  | jq -r '(.content // .)[] | "\(.email // .login)\t\(.firstName) \(.lastName)"' \
  > altitude_review/firm_users.tsv

Then for every Contact candidate, block creation if ANY of these match:

The candidate's email is in the firm users list (exact match)
The candidate's email domain matches the firm's domain (e.g. @<firm-domain>.com, @m62.ai)
The candidate's full name (first+last, case-insensitive) matches a firm user

If matched, record in altitude_review/firm_users_skipped.md (name + why) and skip Contact creation entirely. They are NOT the client's relationship — they are the firm serving the client. The FirmTeam admin flow handles attachment to the household.

Who SHOULD be a Contact:

External professionals: outside attorneys, outside CPAs, insurance agents at external brokerages, prior-firm advisors (e.g. pre-transition), corporate trustees from other companies (e.g. an independent corporate trustee)
Family members and personal contacts (healthcare agents, successor trustees, guardians, executors who are individuals)
Vendors / service providers (property managers, household staff when recorded as Contacts, marina managers, etc.)

Who should NOT be a Contact (belongs on FirmTeam instead):

The firm's lead advisor, co-advisor, junior advisors, analysts, planners
Firm operations (COO, CTO, compliance officer, head of ops)
Firm client-service team (client-service associates, administrative staff)
Firm interns
Any email ending in the firm's domain

Real example: a recent onboarding run wrongly created 5 firm employees (all matching the firm's email domain) as Contacts + ADVISOR relationships. They belong on the household's FirmTeam, not as per-household Contacts. Always run the precheck above first.

Prerequisites

Required Tools

This skill runs cross-platform (macOS, Linux, Windows). The following tools must be installed and on the user's PATH before running. Verify each at the start of Step 0 with shutil.which(...) and fail fast with a clear message if anything is missing — do NOT attempt to install tooling automatically.

Tool	Why	macOS	Linux	Windows
Python 3.9+	Script runtime for .docx/.xlsx/.eml/large PDFs	`brew install python`	`apt install python3`	`winget install Python.Python.3.12` (avoid the Microsoft Store stub — it silently redirects to a non-functional alias)
pip packages	Document parsing	`pip install pypdf python-docx openpyxl requests`	same	same
qpdf	Decrypt password-protected PDFs	`brew install qpdf`	`apt install qpdf` or `dnf install qpdf`	`winget install qpdf.qpdf` or `choco install qpdf` or `scoop install qpdf`
poppler (pdftotext)	Text-first PDF extraction (REQUIRED, not optional) — the default PDF read strategy uses `pdftotext -layout` before falling back to Claude's Read tool. Avoids the 2000px image-dimension limit that scanned-PDF pages can hit.	`brew install poppler`	`apt install poppler-utils`	`winget install oschwartz10612.Poppler` or `choco install poppler`
tesseract (OCR)	Fallback for scanned PDFs where `pdftotext` returns empty (i.e., pure image PDFs — trust documents, deeds, handwritten notes). Pipe `pdftoppm -r 150` → `tesseract` to get text.	`brew install tesseract`	`apt install tesseract-ocr`	`winget install UB-Mannheim.TesseractOCR` or `choco install tesseract`
pandoc (optional)	Cross-platform .docx → text	`brew install pandoc`	`apt install pandoc`	`winget install JohnMacFarlane.Pandoc`
curl	Occasional API examples (all scripted work uses `requests`)	built-in	built-in	built-in on Windows 10 1803+ (`C:\Windows\System32\curl.exe`)

Verify with this snippet (use PYTHON from Cross-Platform Setup below):

# check_prereqs.py
import shutil, socket, sys
missing = []
# Required tools
for tool in ("qpdf", "pdftotext"):        # pandoc + tesseract are optional-but-recommended
    if not shutil.which(tool):
        missing.append(tool)
# Python packages
try:
    import pypdf, docx, openpyxl, requests  # noqa: F401
except ImportError as e:
    missing.append(f"python package: {e.name}")
# DNS reachability check (fail fast if the user is on a restricted network)
try:
    socket.gethostbyname("api.m62.live")
except socket.gaierror:
    missing.append("DNS: cannot resolve api.m62.live (check network or set up hosts override — see Step 0.c)")
# SSL chain probe (R-W5 amendment, Li-Yang rerun) — macOS Python 3.8
# bundled certifi can fail to validate the api.m62.live cert chain. Probe
# with urllib; on CERT failure, fall back to curl-based recipes (curl
# uses the system keychain) or fix the chain via:
#   pip install certifi && export SSL_CERT_FILE=$(python3 -m certifi)
import ssl, urllib.request
try:
    urllib.request.urlopen("https://api.m62.live/api/v1/health", timeout=10)
except ssl.SSLError as e:
    missing.append(
        f"SSL: cert chain validation failed ({e}). "
        "On macOS Python 3.8, run "
        "'pip install certifi && export SSL_CERT_FILE=$(python3 -m certifi)' "
        "OR fall back to the curl-based recipes documented below "
        "(curl uses the system keychain and is unaffected)."
    )
except urllib.error.HTTPError:
    pass  # non-2xx is fine here; we're only probing the SSL handshake
except urllib.error.URLError as e:
    if "CERTIFICATE" in str(e).upper() or "SSL" in str(e).upper():
        missing.append(f"SSL: {e}")
if missing:
    sys.exit(f"Missing prerequisites: {', '.join(missing)}")
print("All prerequisites OK")

SSL fallback recipes (R-W5 amendment, Li-Yang rerun). If the SSL probe fails on macOS Python 3.8 with a CERTIFICATE_VERIFY_FAILED error and reinstalling certifi does not resolve it, the agent SHOULD use curl for all API calls instead of urllib/requests. curl reads the system keychain on macOS and is not affected by the bundled-certifi issue. Replace any recipe of the form requests.get(url, headers={...}) with the equivalent subprocess.run(["curl", "-s", "-H", f"Authorization: Bearer {jwt}", url], ...). This is a workaround, not a permanent fix — Python ≥3.10 with a fresh certifi install resolves the underlying issue.

Windows-Specific Notes

Python alias trap: Windows 10+ ships a python.exe stub that opens the Microsoft Store instead of running Python. Verify with python --version. If it opens the Store, disable the alias under Settings → Apps → Advanced app settings → App execution aliases and install real Python from python.org or winget.
Long paths (MAX_PATH 260): household folders with deep nesting can exceed Windows' legacy 260-character path limit. Either enable long paths (reg add HKLM\SYSTEM\CurrentControlSet\Control\FileSystem /v LongPathsEnabled /t REG_DWORD /d 1 /f as admin, then reboot) or place household folders at a short root like C:\cl\ instead of the default Documents tree.
File paths in prompts: when passing file paths to sub-agents, use forward slashes or raw strings in Python (r"C:\cl\Smith" or "C:/cl/Smith"). Mixing backslashes with regular strings causes \n, \t, \r escapes to fire unexpectedly.
PowerShell execution policy: running .ps1 scripts may be blocked by the default Restricted policy. For the refresh scripts, either run with powershell -ExecutionPolicy Bypass -File tools\refresh-api-spec.ps1 or set the policy once with Set-ExecutionPolicy -Scope CurrentUser RemoteSigned.
No bash-isms: Do not write &&, ||, $(...) command substitution, ${VAR} expansion, single-quote heredocs, or python -c "..." with embedded newlines. Always write scripts to a .py file and run them with python script.py.
Line endings: Python handles CRLF/LF transparently. If you write a .py helper script on Windows, don't worry about line endings.

Step 0: Load Saved Configuration + Authenticate

Do this FIRST before anything else.

Altitude implements a full OAuth 2.1 + PKCE + Dynamic Client Registration authorization server — the exact same protocol Claude uses for its MCP/connector integrations (RFC 8414 / RFC 7591 / RFC 9728 / RFC 7636). This is the preferred interactive auth mode for the skill: the user signs in on Altitude's own hosted login page in a browser, approves the client, and the skill receives a JWT access token via a local loopback callback.

The access token returned by OAuth is a standard Altitude JWT — it works for every REST endpoint (/api/v1/individual, /api/v1/household, /api/v1/document, etc.), not just MCP endpoints, despite the mcp:read/mcp:write scope names.

Auth modes supported:

Mode	Header used on every request	When to use
OAuth (browser)	`Authorization: Bearer <access_token>`	Default for interactive use. Altitude-hosted login, optional MFA, refresh tokens.
API Key	`X-API-Key: ak_live_...`	Automation, CI, long-lived server integrations. No browser needed.
JWT (direct)	`Authorization: Bearer <id_token>`	Fallback: user pastes a JWT obtained out-of-band (e.g., from the Altitude UI session).

0.a — Config file schema

{HOME_DIR}/.altitude/config.json (where HOME_DIR is $HOME on macOS/Linux or %USERPROFILE% on Windows). The config supports all three modes via an authMode discriminator:

{
  "authMode": "oauth" | "api_key" | "jwt",
  "baseUrl": "https://api.m62.live",
  "firmName": "Firm A",

  "apiKey": "ak_live_xxxxxxxx",                    // if authMode=api_key
  "jwt": "eyJhbGciOiJIUzUxMi...",                  // if authMode=jwt (manual paste)

  // if authMode=oauth — populated by the OAuth flow below:
  "oauth": {
    "clientId": "{firm-uuid}",
    "accessToken": "eyJhbGciOiJIUzUxMi...",
    "refreshToken": "k8f3...",
    "tokenType": "Bearer",
    "expiresAt": "2026-04-18T18:00:00Z",
    "scope": "mcp:read mcp:write",
    "email": "advisor@firm.com"                    // cached only for display
  }
}

Security rules (enforce strictly):

NEVER write the password to disk. OAuth is specifically designed so the skill never sees the password — the browser handles that directly with Altitude.
Keep the config file chmod 600 on Unix; on Windows, NTFS per-user ACLs under %USERPROFILE% provide equivalent protection.
When the accessToken is within 5 minutes of expiry, silently refresh via POST /oauth/token with grant_type=refresh_token. If the refresh fails (revoked, expired), fall back to the full browser auth flow.

0.b — If config exists and credentials are current

authMode=api_key + apiKey set → smoke-test with GET /api/v1/authenticate → use
authMode=oauth + accessToken not expired → use immediately
authMode=oauth + accessToken expired but refreshToken valid → refresh silently
Any other state → run the appropriate auth flow below

0.c — Auth Mode 1: OAuth (browser, recommended for interactive use)

This is the Claude-connector flow. The skill acts as a public OAuth client:

Discover endpoints — GET {baseUrl}/.well-known/oauth-authorization-server returns:

{
  "issuer": "https://api.m62.live",
  "authorization_endpoint": "https://api.m62.live/oauth/authorize",
  "token_endpoint": "https://api.m62.live/oauth/token",
  "registration_endpoint": "https://api.m62.live/oauth/register",
  "scopes_supported": ["mcp:read", "mcp:write"],
  "grant_types_supported": ["authorization_code", "refresh_token"],
  "response_types_supported": ["code"],
  "code_challenge_methods_supported": ["S256"],
  "token_endpoint_auth_methods_supported": ["none"]
}

Cache these endpoints.

Dynamically register the skill as an OAuth client (RFC 7591). This is a one-time operation — after the first successful registration, reuse the clientId from config. POST {registration_endpoint} with JSON:
```
{
  "client_name": "M62 Altitude Onboarding Skill",
  "redirect_uris": ["http://127.0.0.1:<random-free-port>/callback"],
  "grant_types": ["authorization_code", "refresh_token"],
  "token_endpoint_auth_method": "none",
  "response_types": ["code"]
}
```
Allowed redirect URIs are http://localhost, http://127.0.0.1, or https://. The response contains client_id — save it to config.oauth.clientId for reuse.
Start a local loopback HTTP server on 127.0.0.1:<port> to receive the OAuth redirect. Bind port 0 to let the OS pick a free port, then read the assigned port.

Generate PKCE values (RFC 7636, S256 only):

import secrets, hashlib, base64
code_verifier = base64.urlsafe_b64encode(secrets.token_bytes(64)).rstrip(b"=").decode()
code_challenge = base64.urlsafe_b64encode(
    hashlib.sha256(code_verifier.encode()).digest()
).rstrip(b"=").decode()
state = secrets.token_urlsafe(32)  # CSRF protection

Open the browser to the authorization endpoint with query parameters:
```
{authorization_endpoint}
  ?response_type=code
  &client_id={clientId}
  &redirect_uri=http://127.0.0.1:{port}/callback
  &scope=mcp:read%20mcp:write
  &state={state}
  &code_challenge={code_challenge}
  &code_challenge_method=S256
```
The user sees Altitude's own login page (not the skill's UI) in their browser, enters their email + password, and Altitude authenticates them. On success, Altitude redirects to http://127.0.0.1:{port}/callback?code=XXX&state=YYY.
Local server catches the redirect, validates state, captures code, shows the user a "Signed in — you can close this tab" page, then shuts down.

Exchange code for tokens — POST {token_endpoint} as application/x-www-form-urlencoded:

grant_type=authorization_code
&code={captured_code}
&code_verifier={code_verifier}
&redirect_uri=http://127.0.0.1:{port}/callback
&client_id={clientId}

Response (200):

{
  "access_token": "eyJhbGciOiJIUzUxMi...",
  "token_type": "Bearer",
  "expires_in": 3600,
  "refresh_token": "k8f3...",
  "scope": "mcp:read mcp:write"
}

Save accessToken, refreshToken, expiresAt = now + expires_in - 30s (30s buffer), and tokenType to config.oauth.*.

Complete script — write this to a temp file and run it. It handles the whole flow including the loopback server:

# altitude_oauth_login.py
import base64, hashlib, http.server, json, os, pathlib, secrets, socketserver
import sys, threading, urllib.parse, urllib.request, webbrowser

BASE = sys.argv[1]  # e.g., https://api.m62.live

# 1. Discover endpoints
meta_url = f"{BASE}/.well-known/oauth-authorization-server"
meta = json.loads(urllib.request.urlopen(meta_url, timeout=10).read())

# 2. Load or register client
home = pathlib.Path(os.environ.get("USERPROFILE") or os.environ["HOME"])
cfg_path = home / ".altitude" / "config.json"
cfg_path.parent.mkdir(exist_ok=True)
cfg = json.loads(cfg_path.read_text()) if cfg_path.exists() else {}
client_id = cfg.get("oauth", {}).get("clientId")

# 3. Pick a free loopback port
with socketserver.TCPServer(("127.0.0.1", 0), None) as s:
    port = s.server_address[1]
redirect_uri = f"http://127.0.0.1:{port}/callback"

if not client_id:
    reg_body = json.dumps({
        "client_name": "M62 Altitude Onboarding Skill",
        "redirect_uris": [redirect_uri],
        "grant_types": ["authorization_code", "refresh_token"],
        "token_endpoint_auth_method": "none",
        "response_types": ["code"],
    }).encode()
    req = urllib.request.Request(meta["registration_endpoint"], data=reg_body,
                                 headers={"Content-Type": "application/json"})
    reg = json.loads(urllib.request.urlopen(req, timeout=10).read())
    client_id = reg["client_id"]
    print(f"Registered OAuth client: {client_id}")

# 4. PKCE
cv = base64.urlsafe_b64encode(secrets.token_bytes(64)).rstrip(b"=").decode()
cc = base64.urlsafe_b64encode(hashlib.sha256(cv.encode()).digest()).rstrip(b"=").decode()
state = secrets.token_urlsafe(32)

# 5. Loopback server to catch the redirect
result = {}
class Handler(http.server.BaseHTTPRequestHandler):
    def log_message(self, *a): pass  # silence
    def do_GET(self):
        qs = urllib.parse.urlparse(self.path).query
        params = dict(urllib.parse.parse_qsl(qs))
        result.update(params)
        body = b"<html><body><h2>Signed in. You can close this tab.</h2></body></html>"
        self.send_response(200)
        self.send_header("Content-Type", "text/html")
        self.send_header("Content-Length", str(len(body)))
        self.end_headers()
        self.wfile.write(body)
        threading.Thread(target=self.server.shutdown, daemon=True).start()

server = http.server.HTTPServer(("127.0.0.1", port), Handler)
threading.Thread(target=server.serve_forever, daemon=True).start()

# 6. Open the browser
auth_url = meta["authorization_endpoint"] + "?" + urllib.parse.urlencode({
    "response_type": "code", "client_id": client_id, "redirect_uri": redirect_uri,
    "scope": "mcp:read mcp:write", "state": state,
    "code_challenge": cc, "code_challenge_method": "S256",
})
print(f"Opening browser to: {auth_url}")
try: webbrowser.open(auth_url)
except Exception: print("Could not open browser automatically — open the URL manually.")

# Wait for callback (with 5 min timeout)
import time
deadline = time.time() + 300
while not result and time.time() < deadline:
    time.sleep(0.5)
server.shutdown()

if "error" in result:
    sys.exit(f"OAuth error: {result.get('error')} — {result.get('error_description','')}")
if result.get("state") != state:
    sys.exit("OAuth state mismatch — possible CSRF. Aborting.")
if "code" not in result:
    sys.exit("Timed out waiting for OAuth redirect.")

# 7. Exchange code for tokens
token_body = urllib.parse.urlencode({
    "grant_type": "authorization_code",
    "code": result["code"], "code_verifier": cv,
    "redirect_uri": redirect_uri, "client_id": client_id,
}).encode()
req = urllib.request.Request(meta["token_endpoint"], data=token_body,
    headers={"Content-Type": "application/x-www-form-urlencoded"})
tok = json.loads(urllib.request.urlopen(req, timeout=10).read())

# 8. Save config
import datetime
expires_at = datetime.datetime.utcnow() + datetime.timedelta(seconds=tok["expires_in"] - 30)
cfg["authMode"] = "oauth"
cfg["baseUrl"] = BASE
cfg.setdefault("oauth", {}).update({
    "clientId": client_id,
    "accessToken": tok["access_token"],
    "refreshToken": tok.get("refresh_token"),
    "tokenType": tok.get("token_type", "Bearer"),
    "expiresAt": expires_at.isoformat() + "Z",
    "scope": tok.get("scope"),
})
cfg_path.write_text(json.dumps(cfg, indent=2))
if os.name == "posix":
    os.chmod(cfg_path, 0o600)
print("OAuth login complete. Config saved.")

0.d — OAuth token refresh (automatic)

When the access token is close to expiry, refresh silently:

# altitude_oauth_refresh.py
import json, os, pathlib, urllib.parse, urllib.request, datetime, sys
home = pathlib.Path(os.environ.get("USERPROFILE") or os.environ["HOME"])
cfg = json.loads((home / ".altitude" / "config.json").read_text())
base = cfg["baseUrl"]
oa = cfg["oauth"]
body = urllib.parse.urlencode({
    "grant_type": "refresh_token",
    "refresh_token": oa["refreshToken"],
    "client_id": oa["clientId"],
}).encode()
req = urllib.request.Request(f"{base}/oauth/token", data=body,
    headers={"Content-Type": "application/x-www-form-urlencoded"})
try:
    tok = json.loads(urllib.request.urlopen(req, timeout=10).read())
except urllib.error.HTTPError as e:
    # Refresh failed (token revoked, expired) — caller should re-run full OAuth flow
    sys.exit(f"REFRESH_FAILED:{e.code}")
expires_at = datetime.datetime.utcnow() + datetime.timedelta(seconds=tok["expires_in"] - 30)
oa["accessToken"] = tok["access_token"]
if tok.get("refresh_token"):  # refresh rotation
    oa["refreshToken"] = tok["refresh_token"]
oa["expiresAt"] = expires_at.isoformat() + "Z"
(home / ".altitude" / "config.json").write_text(json.dumps(cfg, indent=2))

0.e — Auth Mode 2: API Key (automation)

User pastes the key (it starts with ak_live_ for production or ak_test_ for dev). Smoke-test with GET {baseUrl}/api/v1/authenticate — 200 means the key is valid. Save with authMode="api_key".

0.f — Auth Mode 3: Direct JWT paste (fallback)

If the user already has a JWT (from the Altitude UI's browser session, for example), they can paste it directly. Save with authMode="jwt" and jwt=<token>. This mode has no refresh capability — when the JWT expires, prompt for a new paste or switch to OAuth.

0.f.5 — DNS reachability test + loopback fallback

Run this test at Step 0 before any API calls. On some networks (corporate DNS, split-horizon, DNS rebinding filters) api.m62.live fails to resolve via the system resolver even though the service is reachable by IP. This has caused 100% of API calls in the skill to fail with connection timeouts in prior runs.

# altitude_dns_probe.py — run first, cache result
import socket, subprocess, json, os, pathlib, sys
def try_system_dns():
    try:
        ip = socket.gethostbyname("api.m62.live")
        return ("system", ip)
    except socket.gaierror:
        return None
def try_public_dns():
    for server in ("1.1.1.1", "8.8.8.8", "9.9.9.9"):
        try:
            out = subprocess.check_output(
                ["dig", f"@{server}", "api.m62.live", "+short", "+time=3"],
                text=True, timeout=5
            ).strip().splitlines()
            ips = [x for x in out if x and not x.startswith(";")]
            if ips: return ("public", ips[0])
        except Exception: pass
    return None

result = try_system_dns() or try_public_dns()
if not result:
    sys.exit("DNS: cannot resolve api.m62.live via any method. Check network/VPN/firewall.")

method, ip = result
home = pathlib.Path(os.environ.get("USERPROFILE") or os.environ["HOME"])
probe_file = home / ".altitude" / "dns_probe.json"
probe_file.parent.mkdir(exist_ok=True)
probe_file.write_text(json.dumps({"method": method, "ip": ip, "target": "api.m62.live"}))
print(f"DNS OK via {method}: {ip}")

If method == "system" → system DNS works, use normal Python requests or curl.

If method == "public" → system DNS is broken but public DNS has the IP. Every subsequent API call must override. Two patterns:

curl with --resolve (simplest, reliable):

curl --resolve "api.m62.live:443:<IP>" -H "X-API-Key: $KEY" "https://api.m62.live/api/v1/..."

Python requests with connection patching (cleaner for scripts):

# altitude_http.py
import json, os, pathlib, requests
from urllib3.util import connection
home = pathlib.Path(os.environ.get("USERPROFILE") or os.environ["HOME"])
probe = json.loads((home / ".altitude" / "dns_probe.json").read_text())
if probe["method"] == "public":
    _orig = connection.create_connection
    def _patched(addr, *args, **kwargs):
        host, port = addr
        if host == "api.m62.live":
            addr = (probe["ip"], port)
        return _orig(addr, *args, **kwargs)
    connection.create_connection = _patched
# Now use requests normally — DNS patching is transparent

On Windows with curl.exe, the same --resolve flag works. PowerShell's Invoke-WebRequest does not support --resolve; use curl.exe or a Python script via PowerShell instead.

Re-probe every hour (IP can change). Cache the IP in dns_probe.json with a TTL check.

0.g — Pick the right header per request

The skill's helper emits the correct header automatically based on authMode:

# altitude_auth.py — load once, reuse everywhere
import json, os, pathlib, datetime
home = pathlib.Path(os.environ.get("USERPROFILE") or os.environ["HOME"])
cfg = json.loads((home / ".altitude" / "config.json").read_text())

def ensure_fresh():
    """Refresh OAuth token if within 5 minutes of expiry."""
    if cfg.get("authMode") != "oauth": return
    exp = datetime.datetime.fromisoformat(cfg["oauth"]["expiresAt"].rstrip("Z"))
    if exp - datetime.datetime.utcnow() < datetime.timedelta(minutes=5):
        import subprocess, sys
        subprocess.check_call([sys.executable, "altitude_oauth_refresh.py"])
        # reload
        global cfg
        cfg = json.loads((home / ".altitude" / "config.json").read_text())

def headers():
    ensure_fresh()
    mode = cfg.get("authMode", "api_key")
    if mode == "api_key":
        return {"X-API-Key": cfg["apiKey"]}
    if mode == "oauth":
        return {"Authorization": f"Bearer {cfg['oauth']['accessToken']}"}
    if mode == "jwt":
        return {"Authorization": f"Bearer {cfg['jwt']}"}
    raise RuntimeError(f"Unknown authMode: {mode}")

def base_url(): return cfg["baseUrl"]
def firm_name(): return cfg.get("firmName", "")

0.h — Prompt the user to choose

If no usable config exists, ask:

How should I authenticate to Altitude?

OAuth (browser, recommended) — I'll open Altitude's login page in your browser. You sign in there; I never see your password. I'll cache a short-lived access token + refresh token.

API Key (automation) — you paste an ak_live_... key. Good for CI or long-running integrations where no human is present.

JWT paste (fallback) — paste a JWT obtained from your existing Altitude browser session.

And: "Which environment? Production (https://api.m62.live) or Development (http://localhost:8080)?"

Then run the script for the chosen mode, save config, and proceed.

0.i — Backwards compatibility

Config files without authMode but with apiKey set should be treated as authMode=api_key for transparent upgrade. Write out an updated config with authMode set on the next run.

Cross-Platform Setup

Detect the operating system and set platform-appropriate defaults. Do this ONCE at the start and reuse throughout:

import platform, shutil, os, tempfile

OS = platform.system()  # "Windows", "Darwin", "Linux"

# Python command
PYTHON = "python" if OS == "Windows" else "python3"

# Temp directory (NEVER hardcode /tmp/)
TMPDIR = tempfile.gettempdir()  # e.g., C:\Users\X\AppData\Local\Temp on Windows, /tmp on Unix

# Word doc converter
if shutil.which("textutil"):
    DOCX_CMD = "textutil -convert txt"          # macOS
elif shutil.which("pandoc"):
    DOCX_CMD = "pandoc -t plain -o"             # Cross-platform
else:
    DOCX_CMD = None  # Fall back to python-docx (see below)

# PDF decryptor
QPDF = shutil.which("qpdf")
# Install if missing:
#   macOS:   brew install qpdf
#   Windows: choco install qpdf  OR  winget install qpdf  OR  scoop install qpdf
#   Linux:   apt install qpdf    OR  dnf install qpdf

Save these values and use them for all subsequent commands. When this skill says python3, use PYTHON. When it says /tmp/, use TMPDIR. When it says textutil, use DOCX_CMD.

Full OpenAPI Spec

The full Altitude OpenAPI specification is available at api-docs/api.json relative to this skill's directory. If you encounter an endpoint or schema not covered in the reference files, search the full spec: Glob pattern "**/m62-altitude-onboarding/**/api.json" then use Grep to find specific endpoints or schema definitions.

Additional Requirements

firmId (UUID) for the target firm — typically discovered during Phase 1 when querying Altitude

Workflow Overview

Phase 0.5: External Sources    → Load firm-CRM / advisor-DB / custodian-API records as authoritative data
Phase 1:   Query Altitude       → Find existing household + its full entity universe
Phase 2:   Scan Documents       → Classify ALL files, create read-tracking checklist
Phase 3:   Extract Entities     → PARALLEL agents read files, write extraction caches
Phase 3M:  Merge Extractions    → Combine all agent caches into unified extraction
Phase 3.5: Cross-Doc Validation → Name enrichment, relationship inference, absence tracking
Phase 3.7: Self-Audit           → Adversarial review: any unread files? any unnamed people? any missing entities?
Phase 4:   Match & Merge        → Match extracted entities to existing ones, diff fields
Phase 5:   Review               → Show user what will change (fills + conflicts)
Phase 6:   Push Updates         → PATCH existing entities, POST new ones (with approval)
Phase 7:   Upload Documents     → Associate each document with its correct entity

Parallel Extraction Strategy

Phase 3 uses parallel sub-agents to avoid context exhaustion. The orchestrator (you) NEVER reads document contents directly. Instead, you spawn extraction agents that each handle a subset of files and write their results to disk.

Batching rules — split by subdirectory, then by count. Hard cap: 25 files per batch.

Group files by subdirectory first. Each top-level folder in the household directory becomes a candidate batch (e.g., Identification/, LLC/, Tax Documents/, Financial Statements/, Insurance/, Estate Planning/).
Cap every batch at 25 files. Any group exceeding 25 gets split:
- 26-50 files → 2 batches of ~18-25
- 51-75 files → 3 batches of ~17-25
- 76+ → more splits, 25-file cap Never let a single batch exceed 30 files — sub-agent context pressure becomes severe beyond that, and image-heavy PDFs compound the load.
If a subdirectory has < 4 files, merge it with another small directory into one batch.
Target batch size: 15-20 files is the sweet spot. Very small families (< 10 files total) use 1 batch; larger families get 5-10 batches running in parallel.
Parallelism budget: 5-8 concurrent agents is the default target. 10+ concurrent agents has hit output-size limits in practice; split into waves if needed.
Imbalance is OK — don't force-balance batches. A batch of 16 mixed files + a batch of 22 all-statements is fine. Grouping by document type (all statements together, all trust docs together) is more valuable than perfect file-count parity, because agents can apply type-specific heuristics (statement period parsing, trust role extraction).

Historical precedent:

Family X (85 files) → 5 batches of 10-21 files, completed in ~9 minutes
Family Y (215 files) → 8 batches of 16-37 files; the 37-file batch strained context and retried once. Cap of 25 would have prevented the retry.

Example batching for a 40-file household:

Batch 1 (Agent A): Identification/ (3 files) + Onboarding/ (1 file) = 4 files
Batch 2 (Agent B): LLC/ (8 files) + Estate Planning/ (4 files) = 12 files
Batch 3 (Agent C): Tax Documents/ files 1-10 = 10 files
Batch 4 (Agent D): Tax Documents/ files 11-14 + Financial Statements/ (5 files) + Insurance/ (3 files) = 12 files

Each extraction agent receives:

Its file list (absolute paths)
The household name and folder path
The extraction field definitions (from this skill's Phase 3 entity fields section)
The document type patterns (from references/document_type_patterns.md)
Instructions to write output to altitude_review/extraction_cache_batch_{N}.jsonl

Each extraction agent produces:

One JSONL file: altitude_review/extraction_cache_batch_{N}.jsonl
One tracker section: altitude_review/file_tracker_batch_{N}.md

After ALL agents complete, the orchestrator:

Reads all extraction_cache_batch_*.jsonl files
Reads all file_tracker_batch_*.md files
Merges into unified extraction_cache.jsonl and file_tracker.md
Verifies 100% file coverage before proceeding to Phase 3.5

Spawn agents using the Agent tool:

Agent(
  prompt="[extraction agent prompt with file list and instructions]",
  description="Extract batch N ({directory_name})",
  mode="bypassPermissions"
)

Launch ALL extraction agents in a single message so they run in parallel. Do NOT launch them sequentially — that defeats the purpose.

Phase 0.5: External-Source Preload (pluggable, run BEFORE Phase 1)

Many firms maintain authoritative client data outside the household document folder: CRM exports, Salesforce / HubSpot databases, internal wealth-platform APIs, custodian data feeds, partner-firm shared drives, or household-spreadsheet templates. Fields in these sources (DOB, SSN, email, phone, billing rates, outside-advisor rosters) are often more complete and more canonical than what lives in discovery-folder documents — they're the firm's system of record.

Loading these BEFORE Phase 1 lets the skill treat their fields as authoritative extraction records (with asOfDate = source timestamp), so Phase 4's latest-date-wins resolution naturally favors them over stale PDFs.

Rule — always check for external sources first

Before starting Phase 1, ask or auto-detect:

Are there any firm-side authoritative data sources for this household that I should load before reading documents? Examples: CRM exports, Salesforce queries, an internal platform API, shared-drive spreadsheets maintained by the advisor team, custodian direct feeds, or a household intake spreadsheet.

If the user has one, connect it. If they don't, proceed directly to Phase 1.

Source adapter interface

External sources are pluggable — never hard-code "CSV" as the only shape. Implement a thin adapter per source type. Each adapter must produce the same shape so downstream phases don't care where the data came from:

# altitude_external_source.py — base protocol
from typing import Protocol, Iterable
from dataclasses import dataclass

@dataclass
class ExternalRecord:
    source_name: str          # "firm-crm", "salesforce", "advisor-platform-api", ...
    record_type: str          # "household" | "individual" | "contact" | "billing" | "entity" | "service_partner" | ...
    household_key: str        # normalized household identifier (name, external id, etc.)
    fields: dict              # flat dict of field -> value
    as_of_date: str           # ISO date — used for latest-date-wins
    provenance: dict          # {"file": "...", "row": 42, "url": "...", "query": "..."} — auditable

class ExternalSource(Protocol):
    def detect(self, household_folder: str, config: dict) -> bool: ...
    def load(self, household_folder: str, config: dict) -> Iterable[ExternalRecord]: ...

Adapter catalog — known source types

Source type	Example	Detection	Auth
File-based CSV export	Firm's `/Partner Share - Altitude/CRM/*.csv`, firm-drive "Client Masters"	Walk up from the household folder looking for `../../CRM/*.csv` or a configured `crm_paths`; hydrate files first (Step 2.0)	Filesystem ACL
File-based spreadsheet intake	Household onboarding worksheet (`Client Information Sheet.xlsx`)	Exists inside the household folder at `Onboarding/*.xlsx` with recognized tab names	Filesystem
Firm internal API	A `GET /crm/households/{name}` against the firm's platform	Config provides `crm_api_base` + `crm_api_key`	Bearer / API key
Salesforce / HubSpot	SOQL/REST query for matching Household account	Config provides OAuth tokens	OAuth
Custodian direct feed	Schwab/Fidelity client master, Addepar client export	Config provides custodian credentials	OAuth / API key
Database (Postgres / BigQuery / etc.)	Firm's internal client DB	Config provides connection string; query by last name	DSN / service account
Another Altitude tenant (partner-firm handoff)	Partner firm's cross-firm household export	Config provides source tenant + API key	API key

Implementation strategy:

Start with what you have. For the firm today that's CSV. Implement FirmCrmCsvSource (4 CSVs: Client_Households_Export, Connections_Export, Leads_Export, Service_Partners_Export).
Make the interface source-agnostic from day one so adding FirmApiSource or SalesforceSource tomorrow doesn't require rewiring Phase 4.
Cache provenance aggressively — ExternalRecord.provenance should let Phase 5 reviewers trace every field back to row 42 of that CSV or GET call XYZ.

Field-mapping contract

Every adapter outputs normalized field names matching Altitude's DTO vocabulary, NOT the external source's column names. This keeps Phase 4 simple — it doesn't need to know that CRM calls SSN "Tax ID" and Salesforce calls it "tax_identifier". The adapter does that translation.

Example — a firm CRM Client_Households_Export row maps to:

ExternalRecord(
    source_name="firm-crm",
    record_type="individual",
    household_key="FamilyA",
    fields={
        "firstName": "Client", "lastName": "A",
        "dateOfBirth": "1980-01-15",
        "ssn": "000000000",                  # 9 digits, dashes stripped
        "email": "clienta@example.com",
        "phoneNumberPrimary": "5555550100",  # digits only
        "occupation": "Executive",
        "employerName": None,                # CRM says "Self-employed" — map to null, not a string
        "gender": None,                      # blank in CRM, will fill from DL in docs
    },
    as_of_date="YYYY-MM-DD",                 # CRM export date (mtime of source file)
    provenance={"source": "Client_Households_Export.csv", "row": 1,
                "path": "/Partner Share - Altitude/CRM/Client_Households_Export.csv"},
)

Merging external records into the extraction cache

Phase 3's extraction cache (extraction_cache.jsonl) accepts one JSON line per document. External records are conceptually similar — one line per external row or record. Add them to the same cache with a synthetic "file" key so the rest of the pipeline treats them uniformly:

{"file": "[external:firm-crm] Client_Households_Export.csv row 1", "readAt": "2026-04-22T12:00Z", "fileNumber": -1, "asOfDate": "2026-04-22", "entities": {"individuals": [{"name": "Client A", "dob": "1980-01-15", "ssn": "000000000", "_source_kind": "external_crm"}]}}
{"file": "[external:firm-crm] Service_Partners_Export.csv row 3", "readAt": "2026-04-22T12:00Z", "fileNumber": -2, "asOfDate": "2026-04-22", "contacts": [{"firstName": "External", "lastName": "Manager", "jobTitle": "Manager", "biography": "Management Firm X", "_source_kind": "external_crm"}]}

Negative fileNumber values mark external records so Phase 3M can filter/count them separately from document extractions. asOfDate drives latest-date-wins (Rule 40).

Billing, fees, team assignments

CRM / external sources are usually the only place where household-level billing terms and firm-team assignments live — they don't appear in trust agreements or tax returns. Always map these fields if the external source has them:

Household.billing.* (feeStructure, feePercent, minimumFee, frequency, method)
Household.firmTeam (AL Team, CP Team — FirmTeam assignment, NOT Contacts)
Outside professionals roster (Service Partners) — these become Contacts via the same cross-doc dedup rules as document-sourced Contacts

Security constraints

External sources commonly contain unredacted SSN, SIN, DOB, and billing data. Treat with same discipline as Rule 9 (sensitive data):

Never log raw SSN or tax IDs — only the last 4 on trace logs.
Write external-source findings to altitude_review/sensitive_data.json if the source returned anything critical severity.
Memory-only for OAuth tokens / DSNs — the skill's auth helper (altitude_auth) is NOT the right place for firm-DB credentials; use separate .altitude/external_sources/ config, same chmod 600 rules.

Phase 5 review additions

When external sources contributed, the review must have a dedicated section:

## External-Source Contributions

Loaded before Phase 1:
- `firm-crm` (CSV): 4 files, 12 records (1 household, 1 individual, 5 contacts, 5 connections, 1 billing)
- (if present) `firm-api` (REST): 1 household query, 8 field updates

### Fields contributed by external sources (will be queued as auto-fills or latest-date-wins)

| Entity | Field | External Value | Source | asOfDate |
|--------|-------|---------------|--------|----------|
| Client A (Individual) | dateOfBirth | 1980-01-15 | firm-crm / Client_Households_Export.csv row 1 | YYYY-MM-DD |
| Client A (Individual) | ssn | ***-**-9754 | firm-crm / Client_Households_Export.csv row 1 | 2026-04-22 |
| Family A (Household) | billing.feeStructure | AUM_BASED | firm-crm / Client_Households_Export.csv row 1 | 2026-04-22 |
| External Manager (new Contact) | — | — | firm-crm / Service_Partners_Export.csv row 1 | 2026-04-22 |

This gives the RM a clear audit trail showing which fields came from firm-internal records vs discovery documents.

Phase 1: Query Altitude — Get Existing Household Universe

Before touching any documents, query Altitude to understand what already exists.

API Response Shapes — READ THIS FIRST

Altitude endpoints return two different response shapes. Confusing them leads to silent bugs (e.g. len(resp) == 2 is the dict key count, not the item count).

Endpoint pattern	Shape	Count extraction
`GET /api/v1/{entity}?size=N` (list)	`{"content":[], "page":{"totalElements":N,...}}`	`resp["page"]["totalElements"]`
`GET /api/v1/{entity}/search?searchFor=X`	Paginated wrapper (same)	same
`GET /api/v1/{entity}/by-individual/{id}` / `by-household/{id}` / `by-owner/{type}/{id}`	Paginated wrapper	same
`GET /api/v1/entity-relationship/from/{type}/{id}` / `/to/...`	Bare JSON array `[...]`	`len(resp)`
`GET /api/v1/household/{id}/relationships/from`	Bare JSON array	`len(resp)`
`GET /api/v1/{entity}/{id}` (single)	Bare JSON object	n/a

Write a universal parser once and reuse:

def items(resp):
    if isinstance(resp, list): return resp
    if isinstance(resp, dict) and "content" in resp: return resp["content"]
    return []

def total(resp):
    if isinstance(resp, list): return len(resp)
    if isinstance(resp, dict): return resp.get("page", {}).get("totalElements", len(resp.get("content", [])))
    return 0

Graph-First Discovery Rule

Phase 1 discovery starts from the household and traverses the relationship graph outward. Name-pattern search is a fallback — account names in Altitude are often generic ("Holding", "Custody", "Quantinno") and won't match family-surname searches. Step 1.3 has the traversal algorithm; Step 1.4 is a fallback for orphan accounts.

Step 1.1: Search for the household

GET /api/v1/household/search?searchFor={household_name}&size=50
X-API-Key: {api_key}

or with JWT:

GET /api/v1/household/search?searchFor={household_name}&size=50
Authorization: Bearer {token}

If a matching household is found, record its id. If multiple matches, ask the user which one. If no match, note that this is a new household (will need POST later).

Step 1.2: Get the household's full relationship graph

Query outgoing relationships (household → members) and incoming relationships:

GET /api/v1/household/{householdId}/relationships/from
X-API-Key: {api_key}

Or via the standalone entity relationship endpoint:

GET /api/v1/entity-relationship/from/HOUSEHOLD/{householdId}
X-API-Key: {api_key}

This returns all EntityRelationshipDto entries — every individual, legal entity, account, contact, and their relationship types (MEMBER, OWNERSHIP, TRUSTEE, BENEFICIARY, ADVISOR, etc.). Record:

All individual IDs + basic info
All legal entity IDs + entity types
All account (AccountFinancial) IDs + account types
All contact IDs + job titles
Relationship metadata (type, role, percentage, effectiveFrom, effectiveTo)

Step 1.3: Fetch full details for each entity

Account graph traversal — DO NOT trust household.totalAccountCount. In practice the household count often exceeds the number of accounts reachable via direct HOUSEHOLD → ACCOUNT_FINANCIAL relationships, because most accounts hang off trusts and LLCs, not the household itself. In a $1.22B household with 48 accounts, fewer than a dozen were directly owned by the household — the rest were inside trust/LLC sub-graphs.

Traversal algorithm (implement this before moving past Phase 1):

# altitude_account_graph.py — recursively discover all accounts reachable from household
visited_entities = set()   # (entity_type, entity_id) pairs we've already expanded
all_accounts = {}          # account_id -> basic info

def expand(entity_type, entity_id):
    key = (entity_type, entity_id)
    if key in visited_entities: return
    visited_entities.add(key)
    rels = api_get(f"/api/v1/entity-relationship/from/{entity_type}/{entity_id}")
    for r in rels:
        if r["targetEntityType"] == "ACCOUNT_FINANCIAL":
            all_accounts[r["targetEntityId"]] = {
                "id": r["targetEntityId"],
                "name": r["targetEntityName"],
                "ownerType": entity_type,
                "ownerId": entity_id,
                "ownerName": "<look up from existing cache>",
            }
        elif r["targetEntityType"] in ("LEGAL_ENTITY", "INDIVIDUAL"):
            expand(r["targetEntityType"], r["targetEntityId"])  # recurse

expand("HOUSEHOLD", household_id)

This expands Household → its individuals and legal entities → each of their outgoing relationships → any sub-LEs they hold → all the way down to every leaf ACCOUNT_FINANCIAL. The number of accounts discovered should match or exceed household.totalAccountCount. If the discovered count is LOWER, flag as open question (the household counter may include hard-deleted or orphan accounts).

Account search fallback — some accounts may not be wired into the relationship graph (orphan accounts created directly). Also search by household name tokens AFTER graph traversal to catch these:

GET /api/v1/account-financial/search?searchFor={householdNameToken}&size=100

For each individual in the household:

GET /api/v1/individual/{id}

For each legal entity in the household:

GET /api/v1/legal-entity/{id}

For each account discovered via the traversal above:

GET /api/v1/account-financial/{id}

For each contact in the household:

GET /api/v1/contact/{id}

⚠ Endpoint-choice warning — For TangibleAsset / Liability / InsurancePolicy, two forms exist per entity:

/{entity}/by-individual/{id}, /by-household/{id}, /by-legal-entity/{id} — read the direct FK column only; does NOT traverse entity relationships. An asset owned only via an OWNERSHIP relationship (no FK set) is invisible to these endpoints.
/{entity}/by-owner/{ENTITY_TYPE}/{id} — traverses the entity-relationship graph. Returns a strict superset of the FK-only endpoint.

Availability matrix (verify at run-time via OPTIONS probe or by-individual fallback):

Entity	`/by-owner/{TYPE}/{id}`	`/by-individual`	`/by-household`	`/by-legal-entity`
tangible-asset	✅ (since launch)	—	❌ (use /by-owner/HOUSEHOLD)	—
liability	✅ (PR "backend consistency", 2026-04-23)	✅	✅	✅
insurance-policy	✅ (PR "backend consistency", 2026-04-23)	✅	✅	✅

Always use /by-owner/{TYPE}/{id} for Phase 1 discovery after the 2026-04-23 PR is deployed to your target environment. On older builds (pre-PR), the endpoint returns 404 for liability and insurance-policy — fall back to the three narrower endpoints. Code defensively with an error-shape-aware parser:

def items(resp):
    if isinstance(resp, dict) and resp.get("status", 0) >= 400:
        raise RuntimeError(f"API error: HTTP {resp['status']} — {resp.get('detail', resp)}")
    if isinstance(resp, list): return resp
    if isinstance(resp, dict) and "content" in resp: return resp["content"]
    return []

The skill's legacy len(resp) / resp.get('content', resp) patterns will silently interpret a 404 JSON body ({"status":404, "detail":..., ...}) as a page of 7 items (the dict's key count) — this was observed on the recent production run. Always guard on status >= 400 first.

GET /api/v1/tangible-asset/by-owner/INDIVIDUAL/{individualId}
GET /api/v1/tangible-asset/by-owner/LEGAL_ENTITY/{legalEntityId}
GET /api/v1/tangible-asset/by-owner/HOUSEHOLD/{householdId}

GET /api/v1/liability/by-owner/INDIVIDUAL/{individualId}
GET /api/v1/liability/by-owner/HOUSEHOLD/{householdId}

GET /api/v1/insurance-policy/by-owner/INDIVIDUAL/{individualId}
GET /api/v1/insurance-policy/by-owner/LEGAL_ENTITY/{legalEntityId}
GET /api/v1/insurance-policy/by-owner/HOUSEHOLD/{householdId}

Store all of this as the "Altitude Universe" — the complete current state of the household in Altitude. This is the baseline for comparison.

Step 1.4: Search for accounts and contacts by name

Additionally, search for any accounts and contacts by name pattern. Per Rule 67, all entity searches that DO support parentHouseholdId MUST pass it (account search included). Contact search is firm-wide by design.

GET /api/v1/account-financial/search?searchFor={account_name_pattern}&parentHouseholdId={hh_id}&size=50
GET /api/v1/contact/search?searchFor={contact_name_pattern}&size=50  # firm-wide; apply per-result graph filter

Step 1.4b: Rollup health check (Rule 68)

GET /api/v1/household/{id} rollup fields (primaryIndividualName, totalAccountCount, totalMarketValue, totalTangibleAssetValue) may be NULL even when the household has populated entities — they are computed by a nightly job. Build authoritative counts from the per-type list endpoints scoped to parentHouseholdId, NOT from the household rollup:

counts = {
  "individuals":      total(api_get(f"/individual?parentHouseholdId={hh_id}&size=1")),
  "legal_entities":   total(api_get(f"/legal-entity?parentHouseholdId={hh_id}&size=1")),
  "accounts":         total(api_get(f"/account-financial?parentHouseholdId={hh_id}&size=1")),
  "tangible_assets":  total(api_get(f"/tangible-asset?parentHouseholdId={hh_id}&size=1")),
  "liabilities":      total(api_get(f"/liability?parentHouseholdId={hh_id}&size=1")),
  "insurance":        total(api_get(f"/insurance-policy?parentHouseholdId={hh_id}&size=1")),
}

If /household/{id} rollups disagree with the per-type counts, log a "rollup staleness" warning to surface in the Phase 5 review.

Step 1.5: Lookup-by-prior-UUID pass (only on rerun) — Rule 69

If run_state.json exists from a prior run, before proceeding to Phase 2, GET every UUID in run_state.entities.* with ?scope=ALL_TENANTS and classify each into one of four buckets that map directly to the classification field of stale_prior_uuids.json (see deliverable below):

#	Bucket	Lookup result	`classification` value	Recovery action
(a)	live_in_universe	found AND in current universe	`live` (NOT stale — do not write to `stale_uuids[]`; DO write to `details[]` for audit)	none — expected steady state
(b)	orphan_since_prior_run	found AND NOT in current universe	`orphan` (NOT stale in the deletion sense — write under a separate `orphans[]` key for Phase 6 OWNERSHIP wiring)	re-wire via Phase 6 OWNERSHIP edges
(c)	soft_deleted	returns row with `deleted: true` only when `?includeDeleted=true&scope=ALL_TENANTS`	`soft_deleted`	record in `run_state.softDeletedAwaitingHardDelete[]` per Rule 66; fleet aggregator schedules admin hard-delete
(d)	hard_deleted	404 even with `?scope=ALL_TENANTS&includeDeleted=true`	`hard_deleted`	remove the UUID from `run_state.entities.*` so the next run does not retry the lookup
(e)	no_owner_edges (TangibleAsset, AccountFinancial, Liability)	found via direct GET, has `parentHouseholdId` populated, but `owners[]` is empty OR contains no owner with `ownerType` in {INDIVIDUAL, LEGAL_ENTITY}	`no_owner_edges` (write under `no_owner_edges[]` key — see Rule 79). Exempt: `owners[]` consists solely of `ownerType=HOUSEHOLD` entries (mirrors Rule 80 Exemption C — household-as-economic-owner is a recognized design pattern, not a missing-owner bug).	Phase 6 attempts to identify the correct owner from source documents; on ambiguity, emit a Q-blocker. Note (PLT-77 retraction, 2026-04-29): an earlier version of this rule classified by `individualId IS NULL && legalEntityId IS NULL` — those columns don't exist on the DTO. The corrected criterion uses `owners[]` length and content.

If the lookup result is ambiguous (e.g. transient 5xx, network timeout, scope mismatch the skill cannot disambiguate), classify as unknown and surface in the Phase 5 review for the user — do not silently treat as hard-deleted.

If ?scope=ALL_TENANTS returns HTTP 403 (firm-admin API keys cannot use that scope — see Rule 69 numbered-list amendment), fall back to plain GET /api/v1/{resource}/{id} and classify on the fallback response. Only after both calls fail non-200 should the lookup be marked lookup_failed_403 (NOT hard_deleted).

This catches prior-run-created entities invisible to the standard graph traversal.

Deliverable — `stale_prior_uuids.json`

The skill MUST write {household_folder}/altitude_review/stale_prior_uuids.json at the end of Step 1.5 — even on a fresh run with zero prior UUIDs and even when zero stale UUIDs are found. The fleet aggregator that runs after the family runs relies on the file's presence to confirm Step 1.5 executed; an absent file is treated as "Step 1.5 was skipped" and triggers a rerun. Same convention as cross_contamination_findings.json (Rule 71) and backend_enum_gaps.json (Rule 72).

Schema — one entry per UUID, bucketed by classification:

stale_uuids[] — buckets (c) soft_deleted and (d) hard_deleted
orphans[] — bucket (b) orphan
no_owner_edges[] — bucket (e) no_owner_edges (Rule 79; TangibleAsset, AccountFinancial, Liability — entities reachable by parentHouseholdId but with no INDIVIDUAL/LEGAL_ENTITY ownership edge in owners[]). Renamed from no_owner_edges[] 2026-04-29 after PLT-77 retraction — the prior name implied a non-existent FK column gap.
details[] — bucket (a) live audit trail (NEW; see Rule 69 amendment). Required even on an all-clean rerun where stale_uuids[], orphans[], and no_owner_edges[] are all empty. Without it, an empty stale_prior_uuids.json is indistinguishable from "Step 1.5 ran but found nothing" vs "Step 1.5 was silently skipped" — the audit trail in details[] resolves the ambiguity.

{
  "household": "<name>",
  "household_id": "<uuid>",
  "stale_uuids": [
    {
      "uuid": "5ef14ddc-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
      "prior_label": "feedTheFuture",
      "prior_run": "2026-04-15-run-12",
      "lookup_status": "404_with_scope_ALL_TENANTS",
      "classification": "hard_deleted",
      "fleet_aggregator_action": "remove_from_run_state"
    }
  ],
  "orphans": [
    {
      "uuid": "<uuid>",
      "prior_label": "<label from run_state>",
      "prior_run": "<date or run-id>",
      "lookup_status": "200_found_outside_universe",
      "classification": "orphan",
      "fleet_aggregator_action": "schedule_phase6_ownership_wire"
    }
  ],
  "no_owner_edges": [
    {
      "uuid": "<uuid>",
      "type": "TangibleAsset",
      "name": "<asset name>",
      "reason": "individualId=null && legalEntityId=null",
      "fleet_aggregator_action": "phase6_populate_fk_from_source_docs_or_q_blocker"
    }
  ],
  "details": [
    {
      "uuid": "<uuid>",
      "type": "Individual|LegalEntity|AccountFinancial|TangibleAsset|Liability|InsurancePolicy|Contact|Document",
      "name": "<entity name or basename>",
      "lookup_result": "200_live",
      "as_of_date": "2026-04-28T14:23:00Z"
    }
  ]
}

Empty-result form (still required) — details[] is populated on a fresh run that found zero prior UUIDs (it will be []); on a rerun, details[] MUST contain one entry per verified-live UUID even when all three stale arrays are empty:

{"household": "<name>", "household_id": "<uuid>", "stale_uuids": [], "orphans": [], "no_owner_edges": [], "details": []}

The all-clean rerun shape is stale_uuids: [], orphans: [], no_owner_edges: [], and details: [<one entry per verified-live UUID>] — empty arrays alone are NOT sufficient on a rerun.

prior_label is the key under run_state.entities.* that pointed to the UUID (e.g. legalEntities.feedTheFuture). prior_run is the prior run_state.runId or the prior run_state.completedAt date — whichever the prior file recorded.

Rule 69 — Lookup prior-run UUIDs and emit `stale_prior_uuids.json`

On every rerun, before Phase 2, the skill MUST resolve every UUID in the prior run_state.entities.* against ?scope=ALL_TENANTS (with includeDeleted=true for the soft-delete probe), classify each as live / orphan / soft_deleted / hard_deleted / unknown, and write the stale and orphan UUIDs to altitude_review/stale_prior_uuids.json. The file is mandatory — empty schema must still be written when zero stale or orphan UUIDs are found, so the fleet aggregator can reliably read it. Soft-deleted UUIDs are also added to run_state.softDeletedAwaitingHardDelete[] per Rule 66; hard-deleted UUIDs are removed from run_state.entities.* before Phase 2 begins.

Step 1.5b: Build the Altitude Universe index

Create an in-memory index for matching:

{
  "household": { "id": "...", "name": "...", "firmId": "..." },
  "individuals": [
    {
      "id": "...", "firstName": "...", "lastName": "...", "ssn": "...",
      "dateOfBirth": "...", "email": "...", "phone": "...", "addressLegal": "..."
    }
  ],
  "legal_entities": [
    {
      "id": "...", "legalName": "...", "entityType": "...", "taxId": "...",
      "jurisdiction": "...", "formationDate": "...", "incorporationState": "...", "incorporationCountry": "..."
    }
  ],
  "accounts": [
    {
      "id": "...", "name": "...", "accountNumber": "...", "accountCategory": "...",
      "subCategory": "...", "custodianId": "..."
    }
  ],
  "contacts": [
    {
      "id": "...", "firstName": "...", "lastName": "...", "email": "...",
      "phone": "...", "jobTitle": "..."
    }
  ],
  "tangible_assets": [
    {
      "id": "...", "name": "...", "category": "...", "assetType": "...",
      "serialOrIdentifier": "...", "currentValue": "..."
    }
  ],
  "liabilities": [
    {
      "id": "...", "name": "...", "liabilityType": "...", "liabilityStatus": "...",
      "lenderName": "...", "accountNumber": "...", "currentBalance": "...",
      "interestRate": "...", "monthlyPayment": "..."
    }
  ],
  "insurance_policies": [
    {
      "id": "...", "name": "...", "policyCategory": "...", "policyNumber": "...",
      "carrierName": "...", "policyStatus": "...", "coverageAmount": "...",
      "annualPremium": "..."
    }
  ],
  "relationships": [ ... ]
}

Save as {household_folder}/altitude_review/altitude_universe.json for reference.

Step 1.6: Externally-synced-account health check (IMMEDIATE ALERT — do not defer to Phase 4.8)

As soon as the universe is built, scan all accounts for broken external syncs BEFORE launching extraction agents. The user needs to see this up-front so they can open a parallel sync-health ticket while the onboarding runs.

# altitude_sync_health.py
broken = []
for acct in universe["accounts"]:
    ext_ids = acct.get("externalIds") or []
    has_provider = any(e.get("provider") for e in ext_ids)
    mv = acct.get("totalMarketValue")
    last_synced = next((e.get("lastSyncedAt") for e in ext_ids if e.get("provider")), None)
    if has_provider and (mv is None or float(mv) == 0 or last_synced is None):
        broken.append({
            "id": acct["id"], "name": acct["name"], "mv": mv,
            "providers": [e.get("provider") for e in ext_ids],
            "lastSyncedAt": last_synced,
        })

if broken:
    print("⚠ WARNING — accounts with broken/zero external sync (do NOT patch these):")
    for b in broken: print(f"  {b}")
    # Persist to altitude_review/addepar_discrepancies_preextraction.json

Rule 42 (externally-synced accounts are read-only) still applies during extraction. The Phase 1 early alert is additive — it surfaces the problem to the operator so they can (a) open a sync-health ticket in parallel, and (b) expect extraction to flag the account as "needs investigation" rather than "broken due to onboarding script."

Phase 2: Scan & Classify Documents

Step 2.0: OneDrive / cloud-sync hydration pre-scan (REQUIRED before spawning agents)

OneDrive, Dropbox, iCloud and Box store files as "dataless placeholders" until accessed — reading one triggers a download. In extraction sub-agents, a cloud-stub read times out after tens of seconds (default socket timeout), wasting compute. Detect unhydrated files before spawning agents and report them to the user for bulk hydration in Finder/Explorer before the expensive extraction runs.

# altitude_hydration_scan.py
#
# IMPORTANT: do NOT use `dd bs=1 count=1` as the stub probe. On OneDrive/macOS
# an attribute lookup for a 1-byte read blocks for seconds even on already-
# hydrated files — a 3-second threshold mis-flags nearly every file as a stub
# (16x false-positive rate observed on recent run).
#
# Correct probe: attempt an actual 4 KB read through Python with a signal
# timeout (POSIX) or a daemon thread join (Windows). Hydrated files return in
# < 10 ms; real cloud stubs either time out or raise OSError 60 ("Operation
# timed out").
import os, platform, signal, sys, threading
from pathlib import Path

HOUSEHOLD = sys.argv[1]              # absolute path to household folder
HYDRATION_TIMEOUT_SECS = 30          # ceiling for first-time hydration on slow links

class _Timeout(Exception): pass

def is_cloud_stub(path: str, timeout_secs: int = HYDRATION_TIMEOUT_SECS) -> bool:
    """True only if the file truly fails to hydrate within `timeout_secs`."""
    result = {"ok": False}

    def _read():
        try:
            with open(path, "rb") as f: f.read(4096)
            result["ok"] = True
        except Exception: pass  # treat any read error as a stub

    if platform.system() == "Windows":
        t = threading.Thread(target=_read, daemon=True); t.start(); t.join(timeout_secs)
        return not result["ok"]
    def _on_alarm(sig, frame): raise _Timeout()
    old = signal.signal(signal.SIGALRM, _on_alarm)
    signal.alarm(timeout_secs)
    try: _read()
    except _Timeout: return True
    finally: signal.alarm(0); signal.signal(signal.SIGALRM, old)
    return not result["ok"]

stubs = []
for root, _, files in os.walk(HOUSEHOLD):
    if "altitude_review" in root: continue
    for f in files:
        if f.startswith(".DS_Store"): continue
        p = os.path.join(root, f)
        if is_cloud_stub(p): stubs.append(p)

if stubs:
    print(f"❌ {len(stubs)} cloud-stub files detected (read times out):")
    for s in stubs: print(f"  {s}")
    print()
    print("TO HYDRATE (macOS Finder):")
    print("  Right-click the folder(s) containing these files → 'Always Keep on This Device'")
    print("TO HYDRATE (Windows Explorer):")
    print("  Right-click → 'Always keep on this device'")
    print("After hydration (green circle icons), re-run the scan.")
    sys.exit(2)
else:
    print(f"✅ All files hydrated. Safe to proceed with extraction.")

If stubs are found: present the list to the user in a compact form (grouped by parent directory, counts), ask them to hydrate, then re-run this scan before proceeding. Do NOT launch extraction agents if any unhydrated files remain — they will consume hundreds of seconds of agent time timing out on reads.

If all files are hydrated: proceed to the file cache check, then document classification.

Step 2.05: Incremental-Run File Cache (skip-already-seen)

Goal: on a rerun, do not re-extract files whose content hasn't changed since the last successful extraction. Saves substantial time for large households and avoids burning tokens on repeat OCR of 100-page trust agreements.

Maintain a persistent cache at {household_folder}/altitude_review/file_cache.json:

{
  "version": 1,
  "lastRunAt": "2026-04-21T18:40:00Z",
  "files": {
    "Onboarding/Trust Agreement.pdf": {
      "mtime": "2025-05-06T14:22:00Z",
      "size": 2457123,
      "sha256": "a1b2c3...",
      "extractedAt": "2026-04-21T18:05:12Z",
      "cacheLineNumbers": [23],
      "status": "READ"
    }
  }
}

Cache-hit rule — a file can be SKIPPED if and only if ALL three hold AND force is OFF:

The path exists in file_cache.json.
Current mtime and size match the cached values exactly OR sha256 matches.
The cache's cacheLineNumbers references still exist in extraction_cache.jsonl and parse correctly.

Cloud-sync caveat (OneDrive / Dropbox / iCloud / Box): filesystem mtime can change without file content changing when cloud sync touches the file. Prefer sha256 as the primary cache key when running against a cloud-synced folder. Fall back to mtime+size only when sha256 computation is prohibitively slow.

Force mode — bypasses the cache and re-reads every file, overwriting cache entries with fresh extraction. Supported invocations:

force=true / --force / no-cache=true — bypass cache for all files
force=<glob> — bypass cache for matching paths only (e.g. force=Tax/**/*.pdf)

Use force mode when:

The extraction logic has changed (new entity types, new rules, new checklist items)
The skill has been updated and you want to re-run with the new prompts
You suspect prior extraction missed data (OCR was incomplete)
The user explicitly asks to re-extract or reprocess

Default: force=false. Log each file as SKIPPED (cache hit) or READ (force=true, cache bypassed) in the tracker.

Orchestrator snippet (run before spawning extraction agents):

Write file_cache_scan.py and run {PYTHON} file_cache_scan.py (Cross-Platform Setup):

# file_cache_scan.py
import hashlib, json, os, pathlib, sys
from datetime import datetime, timezone

household_folder = sys.argv[1]
force = (len(sys.argv) > 2 and sys.argv[2] in ("true", "--force", "no-cache"))
force_paths = sys.argv[3:]  # optional glob patterns

def sha256_file(p, chunk=1024*1024):
    h = hashlib.sha256()
    with open(p, "rb") as f:
        for b in iter(lambda: f.read(chunk), b""): h.update(b)
    return h.hexdigest()

review_dir = pathlib.Path(household_folder) / "altitude_review"
review_dir.mkdir(exist_ok=True)
cache_path = review_dir / "file_cache.json"
cache = {"version": 1, "files": {}}
if cache_path.exists():
    cache = json.loads(cache_path.read_text())

to_process, to_skip = [], []
for root, _, files in os.walk(household_folder):
    if "altitude_review" in root: continue
    for fn in files:
        if fn.startswith(".DS_Store"): continue
        full = os.path.join(root, fn)
        rel = os.path.relpath(full, household_folder).replace(os.sep, "/")
        st = os.stat(full)
        current_mtime = datetime.fromtimestamp(st.st_mtime, tz=timezone.utc).isoformat()
        entry = cache["files"].get(rel)
        force_this = force or any(pathlib.PurePath(rel).match(p) for p in force_paths)
        if (not force_this and entry and entry.get("size") == st.st_size and
                (entry.get("sha256") == sha256_file(full) or entry.get("mtime") == current_mtime)):
            to_skip.append(rel)
        else:
            to_process.append(rel)

print(json.dumps({"process": to_process, "skip": to_skip, "force": force}))

Pass process to the extraction agents. Pre-populate file_tracker.md with one row per skip file: | N | path | SKIPPED (cache hit) | (see extraction_cache line K) |. After extraction agents complete, update file_cache.json with the new mtime/sha256/extractedAt for each processed file.

Step 2.1: Classify Documents

List all files recursively in the household folder. Classify each document using the patterns in references/document_type_patterns.md. Key classification rules:

Extraction priority:

Tier 1 (extract first): Onboarding sheets, IDs (DLs, passports), trust agreements, LLC operating agreements, articles of organization/incorporation, EIN letters, account applications
Tier 1.5 (NEVER SKIP — compact, data-dense IRS forms): Form 1098 (mortgage → creates liability + tangible asset), Schedule K-1 (ownership percentages — authoritative), Form 1099-DIV/INT/B/R (account + custodian validation), Form W-2 (employer + income), Form 5498 (IRA details). A single 1098 produces 1 liability, 1 tangible asset, and 2+ relationships. A single K-1 produces an entity + ownership relationship with exact percentage.
Tier 2: Personal tax returns (1040), entity tax returns (1065/1120/1120S/1041), account statements, insurance policy declarations, property tax bills, beneficiary designations
Tier 3: Financial statements, meeting notes, presentations, valuations, estate planning flowcharts
Tier 4 (skip): Duplicates ("Copy of", "zDupes"), receipts, .msg files, spreadsheets with personal notes, generic LLM/schema templates with no populated family data (see Generic-Template Detection heuristic in references/document_type_patterns.md)

Document-to-entity association — each document maps to an entity type for upload: Read references/document_entity_association.md for the complete mapping of which document types associate with which Altitude entity type and what documentSubType to use.

Phase 3: Extract Entities from Documents

PDF Reading — TEXT-FIRST by default

⚠ CRITICAL: Do NOT start with Claude's Read tool on PDFs. Claude's Read tool rejects images with any dimension >2000px, and many scanned PDFs (trust documents, deeds, handwritten notes, high-res scans) include pages that trip this limit. The result is "image exceeds 2000px dimension limit" errors that abort whole extraction batches.

Required reading order for every PDF:

Text-first via pdftotext (poppler) — works on any PDF with embedded text:
```
pdftotext -layout -nopgbrk "file.pdf" - | head -c 200000 > /tmp/extracted.txt
```
Then Read /tmp/extracted.txt. This is fast, safe, and avoids the image limit entirely.
If pdftotext returns mostly blank or gibberish → the PDF is a scan. Render each page to a PNG first:
```
# Render pages 1-5 at 200 dpi, cap long side via -scale-to 1800 so Claude can read
pdftoppm -r 200 -scale-to 1800 -f 1 -l 5 "file.pdf" /tmp/scan_page -png
```
Then choose the OCR path based on content type:

2a. Handwritten content (meeting notes, signed statements, margin annotations) → use Claude's Read tool on each PNG directly. Claude vision handles cursive, mixed-case, and arrows/margin marks fluently; tesseract does not. In practice tesseract on handwriting returns word-salad ("hy borotvw , 2p 7. vf shaves") while Claude transcribes the same page accurately. Skip tesseract entirely for handwriting.

2b. Typeset scanned text (faxed letters, older trust documents printed and re-scanned, filings) → tesseract is reliable and much cheaper than vision:
```
for png in /tmp/scan_page-*.png; do
  tesseract "$png" "${png%.png}" -l eng --psm 6
done
cat /tmp/scan_page-*.txt > /tmp/extracted.txt
```
Then Read /tmp/extracted.txt. If tesseract output looks garbled (low confidence, nonsense words, missing punctuation), fall back to Claude Read on the PNGs.
Only fall back to Claude's Read tool on the raw PDF as a last resort — and only if the file is < 5 MB (to avoid loading many high-res pages). If Read fails with the 2000px error, mark the file status=FAILED_IMAGE_TOO_LARGE in the tracker and move on. Do not loop.

Use pypdf for page-index scanning — this is still the best way to find the data-rich pages in a 200-page tax return without loading every page's content:

Large PDF Strategy (20+ pages)

Tax returns and combined statements are often 50-200+ pages. Reading only the first few pages will miss K-1 summaries, W-2s, 1099s, Schedule H, and passthrough entity details buried deep in the document. Use this two-pass strategy, built on the text-first foundation above:

Pass 1 — Page Index Scan (fast, text-only):

Use PYTHON from the Cross-Platform Setup section for all Python invocations. On Windows, write multi-line scripts to a temp .py file instead of using -c to avoid shell quoting issues.

# page_scan.py — write this to a temp file, then run: python page_scan.py
import sys
from pypdf import PdfReader
reader = PdfReader(sys.argv[1])
print(f'Total pages: {len(reader.pages)}')
for i, page in enumerate(reader.pages):
    text = (page.extract_text() or '')[:150].replace('\n', ' | ')
    print(f'  Page {i+1}: {text}')

Run: {PYTHON} page_scan.py "file.pdf" (where {PYTHON} is python on Windows and python3 on macOS/Linux, per Cross-Platform Setup above).

This produces a one-line summary per page. Scan the output for keywords that signal data-rich pages:

Keyword	What It Signals	Action
`K-1`, `Schedule K-1`	Partnership/LLC ownership + income	Read full page — TAX_K1 checklist
`W-2`, `Wage and Tax`	Employer name, wages, SSN	Read full page — TAX_W2 checklist
`1099`, `1099-DIV`, `1099-INT`, `1099-B`, `1099-R`	Account/custodian validation, income	Read full page — TAX_1099 checklist
`1098`, `Mortgage Interest`	Mortgage lender, balance, property	Read full page — TAX_1098 checklist
`Schedule E`, `Passthrough`	Entity names, EINs, income types	Read full page
`Schedule H`, `Household Employ`	Domestic staff, household employer EIN	Read full page
`Schedule A`, `Itemized`	Mortgage interest, charitable, taxes	Skim for amounts
`Schedule C`, `Profit or Loss`	Sole proprietorship business	Read full page
`Sign Here`, `Occupation`, `Preparer`	Occupations, CPA name/phone	Read full page (usually page 2 of 1040)
`8879`, `e-file`	SSNs, AGI confirmation, preparer	Read full page
Entity names (trust names, LLC names)	Entity K-1 details	Read full page
`LESSER`, `TRUST`, or any family surname	Related trust/entity income	Read full page

Pass 2 — Targeted Deep Read: For each flagged page, extract text with pdftotext -f N -l N "file.pdf" (where N is the page number) to a temp file and Read that. Only use Claude's Read tool on the original PDF for the flagged pages if pdftotext returns empty for that specific page (indicating a scanned page). For a typical 200-page return, you'll usually need to read 15-25 key pages.

Minimum pages to ALWAYS read from a personal 1040 return:

Cover letter (page 1) — preparer firm, client address
Form 8879 — SSNs, AGI, preparer name
Form 1040 pages 1-2 — income summary, dependents, occupations, preparer, filing status
Schedule E page 2 — ALL passthrough entity names + EINs
Passthrough income detail pages — entity-by-entity breakdown
Schedule H (if present) — household employment
Any pages with K-1, W-2, 1098, or 1099 keywords

Password-protected PDFs: Tax returns are often password-protected. The password is frequently in the filename (e.g., "pass 701431"). Decrypt before reading:

qpdf --password=PASSWORD --decrypt input.pdf decrypted.pdf

Write the decrypted file to the same directory as the input, or to a temp directory (use Python tempfile.mkdtemp() if needed — do NOT hardcode /tmp/).

If qpdf is not installed:

macOS: brew install qpdf
Windows: choco install qpdf or winget install qpdf or scoop install qpdf
Linux: apt install qpdf or dnf install qpdf

⛔ CRITICAL: Zero-Skip Rule — THE #1 CAUSE OF EXTRACTION FAILURE

EVERY file in the household folder MUST be opened and read. NO EXCEPTIONS.

This is the single most important rule in this skill. In testing, 100% of extraction failures trace back to files that were not read. Not "low quality" files. Not "redundant" files. Files that were simply never opened. An Operating Agreement that contains ownership percentages. A 1099 that reveals an account number. A DocuSign certificate that identifies the employer. An email signature with an attorney's contact info.

You WILL be tempted to skip files. You will think "I already have the EIN from the onboarding sheet, I don't need to read the EIN letter." You will think "The amendments just change the address, I already know the address." You will think "The Sunbiz is just a state filing." Every one of these thoughts leads to missed data. Every document contains something — a name, an address, a date, a registered agent, a formation date — that cannot be found anywhere else.

Do not classify files as low priority and skip them. Do not read 22 out of 60 files and call it done. Read ALL 60. If the context window gets full, save your extraction progress to disk and continue in a follow-up pass.

For each file, at minimum:

PDFs: Read at least page 1. If it's a multi-page form (tax return, statement), use the Large PDF Strategy above to find all data-rich pages.
Images (.jpg, .png): Read with Claude's vision. Even a property photo confirms a real asset exists.
Word docs (.docx): Convert using the platform's DOCX_CMD (see Cross-Platform Setup). Fallback chain: textutil (macOS) → pandoc (cross-platform) → python-docx (write a docx_read.py script — see Standard Document Extraction below for the exact script). If all fail, flag for user — don't silently skip.
Emails (.eml): Parse headers + body. Extract attachments and process them too.

Enforce with a tracking file: After Phase 2 classification, write a file checklist to altitude_review/file_tracker.md with every file path. As you read each file, update the tracker with status (READ/SKIPPED) and a one-line summary of what was extracted. Before Phase 4, parse the tracker and verify ZERO files have status other than READ. If any files remain unread, you MUST read them before proceeding. This is not optional.

Example tracker format:

| # | File | Status | Extracted |
|---|------|--------|-----------|
| 1 | Identification/DL.png | READ | Client B, Spouse B, DOBs, addresses |
| 2 | LLC/Operating Agreement.pdf | READ | Members: Client B 60%, Spouse B 40% |
| 3 | Tax/1099-INT.pdf | PENDING | |

For folders with 10+ files, use the Parallel Extraction Strategy (see Workflow Overview). Spawn one Agent per batch. Each agent handles its assigned files independently and writes results to its own extraction_cache_batch_{N}.jsonl file. The orchestrator merges after all agents complete. For folders with < 10 files, a single agent handles all files.

Extraction Cache (REQUIRED — each agent writes its own)

After reading EACH file, each extraction agent appends what it learned to its own cache file: altitude_review/extraction_cache_batch_{N}.jsonl (one JSON object per line, append-only). The orchestrator later merges all batch files into altitude_review/extraction_cache.jsonl.

⛔ STRICT SCHEMA RULE — ONE JSON OBJECT PER LINE, ONE LINE PER FILE

Every line in the JSONL file MUST be a single valid JSON object with at minimum these required top-level keys: file, fileNumber, readAt, entities.

Forbidden:

Splitting one file's data across multiple lines (no "Dan as Individual on line 5, Dan's trust as LegalEntity on line 6"). All entities/relationships/contacts extracted from a single file belong on the SAME line as nested arrays under entities.
Concatenating multiple JSON objects on one line without a newline between them.
Pretty-printed multi-line JSON.

Orchestrator MUST validate each batch file after the extraction agent completes, before merging. If validation fails, respawn a repair agent for that batch with stricter prompts.

# validate_jsonl.py — run after each batch completes, before merge
import json, pathlib, sys
def validate_jsonl(path):
    errors = []
    with open(path) as f:
        for i, line in enumerate(f, 1):
            line = line.strip()
            if not line: continue
            try:
                obj = json.loads(line)
            except Exception as e:
                errors.append(f"Line {i}: invalid JSON ({e})")
                continue
            for key in ("file", "fileNumber", "entities"):
                if key not in obj:
                    errors.append(f"Line {i}: missing required key '{key}'")
            if "entities" in obj and not isinstance(obj["entities"], dict):
                errors.append(f"Line {i}: 'entities' must be a dict")
    return errors

for batch in sorted(pathlib.Path(sys.argv[1]).glob("extraction_cache_batch_*.jsonl")):
    errs = validate_jsonl(batch)
    print(f"{'FAIL' if errs else 'OK'} {batch.name}: {errs[:5] if errs else ''}")

Each line captures everything extracted from a single file:

{"file": "Identification/DL.png", "readAt": "2026-03-19T22:00:00Z", "fileNumber": 1, "entities": {"individuals": [{"name": "Client B", "dob": "1985-06-01", "gender": "M", "dlNumber": "XXXXXXXXX", "dlState": "FL", "dlExpiry": "2031-06-01", "address": "123 Main Street, City, ST 00000"}]}, "relationships": [], "contacts": [], "accounts": [], "notes": "Both Client B and Spouse B DLs on same image"}
{"file": "LLC/OperatingLLC/Operating Agreement.pdf", "readAt": "2026-03-19T22:01:00Z", "fileNumber": 2, "entities": {"legalEntities": [{"name": "Operating LLC X1", "type": "LLC", "managementType": "MEMBER_MANAGED", "opAgreementDate": "2022-09-29"}]}, "relationships": [{"source": "Client B", "target": "Operating LLC X1", "type": "OWNERSHIP", "percentage": 50, "role": "Managing Member"}], "contacts": [{"name": "Registered Agent Name", "role": "Registered Agent", "address": "456 Agent Way, City, ST 00000"}], "accounts": [], "notes": "Principal: 123 Main Street"}

Why this matters:

Resumability: If context resets at file 150 of 292, the next session reads the cache and picks up at file 151 — no re-reading of the first 150 files
Cross-document validation: Later files can check against earlier extractions ("is this the same trust?") without re-reading the source documents
Subagent handoff: One agent extracts (write

Content truncated for page performance. Open the source repository for the full SKILL.md file.