paperless-canonical-ingestion

star 12

Canonical ingestion workflow for family-office records: discover candidate files in Google Drive, Gmail attachments, OneDrive, and SharePoint; deduplicate minor variants; choose one canonical document; and upsert into Paperless with normalized title, document type, correspondent, and owner-routing tags. Use when seeding Paperless, running periodic syncs, or cleaning duplicates across storage systems.

hvkshetry

By hvkshetry schedule Updated 3/13/2026

play_arrow Run Skill in Manus View GitHub

name: paperless-canonical-ingestion description: | Canonical ingestion workflow for family-office records: discover candidate files in Google Drive, Gmail attachments, OneDrive, and SharePoint; deduplicate minor variants; choose one canonical document; and upsert into Paperless with normalized title, document type, correspondent, and owner-routing tags. Use when seeding Paperless, running periodic syncs, or cleaning duplicates across storage systems.

Paperless Canonical Ingestion

Outcome

Create and maintain a Paperless corpus that is canonical, deduplicated, and queryable by agent role.

Source Discovery (read-only)

Google Drive: search_drive_files, list_drive_items, get_drive_file_download_url
Gmail: search_gmail_messages, get_gmail_message_content, get_gmail_attachment_content
Microsoft: office-mcp.files with search, get, download
Paperless: search_documents, get_document, post_document, update_document, bulk_edit_documents

Workflow

Define scope and query set.

Start from explicit folders, entities, and case names in the request.
Expand with keyword variants (tax years, policy numbers, case IDs, entity aliases).

Collect candidate files.

Pull metadata first: filename, path/link, modified date, sender/issuer, mime type.
Download only shortlisted files needed for canonical comparison or ingest.

Canonicalize duplicates and near-duplicates.

Use normalized filename stem + issuer + date + page count + first-page text similarity.
Treat variants as duplicates when differences are forwarding wrappers, scan quality, or cosmetic edits.
Keep one canonical file per logical document set.
Record why non-canonical copies were skipped.

Rank canonical candidates (highest wins).

executed/signed/final > draft.
authority-issued original > forwarded attachment.
complete packet > partial extract.
machine-readable PDF > low-quality image scan (unless scan is the only executed copy).
newest corrected version > older versions (unless chronology itself is legally important).

Upsert into Paperless.

If a logical document already exists: use update_document for metadata normalization.
If not present: ingest with post_document.
Normalize title, document_type, correspondent, and tags using references/filing-matrix.md.

Apply ownership routing.

Add owner tags so downstream task routing is deterministic.
Use these role tags: owner-estate, owner-hc, owner-io, owner-cos, owner-hd, owner-wellness.
Use multiple owner tags when the document has cross-functional impact.

Verify and report.

Re-query by entity/case tags and spot-check duplicates.
Return summary: uploaded, updated, skipped_duplicate, needs_user_decision.

Canonical Baseline Seed Pack

When seeding family-office context, prioritize these categories:

Estate core: will, irrevocable trust, POA, healthcare directives.
Tax core: latest 3 years personal/entity returns, notices/orders, appeals, payment proofs.
Entity core: formation docs, operating agreements, annual reports, key minutes/resolutions.
Property core: deed/title, mortgage/loan docs, insurance declarations, major service contracts.
Risk core: active policies, material liabilities, active litigation/case packets.

Guardrails

Do not perform Microsoft write operations.
Do not delete source files from Drive/OneDrive/SharePoint/Gmail.
Do not ingest both canonical and clearly duplicate variants.
Prefer update_document over re-upload when Paperless already has the canonical file.

Completion Checklist

Canonical selected for each logical document set.
Paperless metadata normalized (title, document_type, correspondent, tags).
Owner-routing tags applied for all seeded records.
Duplicate decisions documented in the final summary.
Ambiguities surfaced explicitly for user decision.

References

Filing and owner-routing matrix: references/filing-matrix.md

Install via CLI

npx skills add https://github.com/hvkshetry/StewardOS --skill paperless-canonical-ingestion

Repository Details

star Stars 12

call_split Forks 6

navigation Branch main

article Path SKILL.md

Occupations

Office Clerks General 439061

More from Creator

hvkshetry

hvkshetry Explore all skills →