name: scrub-meta description: Strip identifying metadata from any file before sharing externally - office docs, PDFs, images, video, audio, archives. Removes creator, paths, embeds, EXIF, GPS, hyperlinks, normalizes timestamps, verifies clean. Use when sending a file to a client or external party where you must not leak the source (path, user, partner names, software fingerprint, location data, embedded supplier links). user_invocable: true
Scrub Meta
Sanitize a file before sending it externally. Removes everything that fingerprints the originator: filesystem paths, author names, software signatures, embedded media, hyperlinks, EXIF/GPS, document GUIDs, archive timestamps. Verifies the output is clean against caller-supplied and standing leak terms.
Always writes to a fresh copy. The original is never modified.
Always invoke this skill — never roll a manual scrub. A manual one-off scrub will reliably miss at least one leak vector. The most common silent miss is
<x15ac:absPath>insidexl/workbook.xmlof any Office-for-Mac xlsx — invisible in Excel UI, but oneunzip -paway from leaking the original full path including username and folder names. The script's OOXML branch rebuilds the file from scratch via openpyxl, which structurally cannot emit absPath, and the verify step grepsxl/workbook.xmlfor standing leak terms (see~/Documents/GitHub/claude/.claude/skills/personal/scrub-meta/leak-terms.txt) as a backstop. Regression test:tests/test_xlsx_abspath.py.
Flow
- Resolve input
- Detect type via
file --mime-type - Pre-scan with exiftool (snapshot)
- Type-specific scrub (OOXML / PDF / image / video / audio / archive / iWork / legacy-Office / SVG / text / unknown)
- Universal pass (exiftool fallback + xattr strip + delete
._*resource forks) - Verify (exiftool diff + grep leak terms + sha256 round-trip)
- Print markdown report
- Reveal output in Finder
- Print warning: do not open in native app before sending
Invocation
python3 ~/Documents/GitHub/claude/.claude/skills/scrub-meta/scripts/scrub.py <file> [--out PATH] [--terms term1,term2,...]
Defaults:
--out→~/Downloads/<stem>_clean<ext>--termsis merged with~/Documents/GitHub/claude/.claude/skills/personal/scrub-meta/leak-terms.txt(gitignored standing list)
Step 1: Resolve input
Accept the file path as the first positional arg. Resolve to absolute. Fail loudly if missing.
Execution: Direct
Verify: os.path.isfile(path)
Step 1b: Sanitize filename
The filename is a leak vector too — partner names, [External] markers, upload-session IDs, and copy suffixes all travel inside it and survive every content scrub. When --out is not provided, sanitize_filename() strips:
- Standing leak terms from
personal/scrub-meta/leak-terms.txt(whole-token, case-insensitive) and any--termsextras - Sensitivity markers in brackets or parens:
[External],(Internal),[Confidential],[Draft],[NDA],[Client Copy] - Leading upload-session IDs: 10+ digit prefix followed by
_/-/ space (e.g.1778551079623_…) - Trailing copy markers:
(1),(2)from re-downloads - Collapsed whitespace / separators, trimmed edges
Output goes to ~/Downloads/<sanitized>_clean.<ext>. When the rename triggers, the report surfaces a renamed: note with the original filename so the user sees what was stripped. If the caller passes --out explicitly, that wins — no auto-rename. Fallback stem document is used if sanitization empties the string.
Execution: Direct
Verify: report shows renamed: note when stem differs from <src.stem>_clean
Step 2: Detect type
file --mime-type -b <path> returns canonical MIME. Map to a category: ooxml | pdf | image | video | audio | archive | iwork | legacy_office | svg | text | unknown.
Never trust file extension alone. A .xlsx may be a renamed binary; an image may be mislabeled.
Execution: Direct Verify: non-empty MIME string
Step 3: Pre-scan
exiftool -a -G1 -j <input> → JSON snapshot. Used for the diff in Step 6.
Execution: Direct Produces: in-memory JSON Verify: parses as JSON
Step 4: Type-specific scrub
Branch on detected category. All branches always write to --out, never modify input.
OOXML (xlsx / docx / pptx):
- Load workbook/document with appropriate library (openpyxl / python-docx / python-pptx) in
read_only=True, data_only=Truemode. - Iterate values into a fresh document. Drops embedded images, hyperlinks, formulas, custom styles, defined names, comments, revision history.
- Rename single sheets to
Sheet1. Strip core properties: emptycreatorandlastModifiedBy, fixed neutralcreated/modifieddates. - Save intermediate. Then re-package the resulting ZIP entry-by-entry with:
date_time=(1980,1,1,0,0,0),external_attr = 0o644 << 16,create_system = 0(MS-DOS, kills Unix uid/gid extras). - Replace
docProps/app.xmlwith vanilla<Application>Microsoft Excel</Application>+<AppVersion>16.0000</AppVersion>(no Mac fingerprint, no openpyxl signature). - Replace
docProps/core.xmlwith empty creator/lastModifiedBy + neutral fixed dates.
PDF:
qpdf --linearize --object-streams=generate <in> <out>→ rebuild structure.exiftool -all= -overwrite_original_in_place <out>→ strip /Author /Creator /Producer /Title /Subject /Keywords /CreationDate /ModDate plus XMP packets.- Verify no embedded files / no JS / no form data with
pdftk dump_dataif available.
Image (jpg / png / heic / tiff / webp):
exiftool -all= -overwrite_original_in_place <out>after copying input to output path.- Explicit GPS check post-strip — fail if any
GPS*field survives.
Video (mp4 / mov / mkv):
ffmpeg -i <in> -map_metadata -1 -c copy <out>→ strip metadata atoms + location.exiftool -all= -overwrite_original_in_place <out>→ second pass for ID3-style sidecar tags.
Audio (mp3 / m4a / flac / wav / ogg): exiftool -all= -overwrite_original_in_place after copy.
Archive (zip / tar): Walk entries, repack with (1980,1,1,0,0,0) timestamps and zero uid/gid. Strip archive comment.
iWork (numbers / pages / keynote): LibreOffice headless soffice --convert-to xlsx <in> (or pdf for pages/keynote), then re-dispatch on the conversion output.
Legacy Office (xls / doc / ppt): Same — convert to OOXML via soffice --convert-to, then re-dispatch.
SVG / HTML / Markdown: Regex strip — remove <metadata>...</metadata> blocks, sodipodi: and inkscape: namespaces, HTML comments, YAML frontmatter when explicitly requested.
Unknown: Fall through to Step 5 universal pass only. Print a warning naming the detected MIME.
Execution: Direct Verify: type-specific (see verify step)
Step 5: Universal pass
Always run, regardless of category branch:
exiftool -all= -overwrite_original_in_place <out>— catches any EXIF/XMP that survived the type-specific scrub.xattr -cr <out>— clears macOS extended attributes includingkMDItemWhereFroms(download URL!),kMDItemUserTags,quarantine,lastuseddate. Note:com.apple.provenancemay reattach automatically; that's macOS-applied, opaque (no PII), and does not survive transport (email / Signal / Drive / web upload all strip xattrs).- Delete sibling resource fork files:
os.remove(os.path.join(dir, '._' + name))if present.
Execution: Direct
Verify: xattr -l shows only com.apple.provenance or empty
Step 6: Verify
Three checks:
- Exiftool diff — re-run
exiftool -a -G1 -j <out>, compare to the Step 3 snapshot. Report what was removed by group (System, ZIP, XML, EXIF, XMP-dc, IPTC, GPS, etc.). - Leak-term grep — combine
--termsarg with~/Documents/GitHub/claude/.claude/skills/personal/scrub-meta/leak-terms.txt. For OOXML/archives, unzip<out>to a temp dir and grep across metadata files only (docProps/,*.rels,xl/workbook.xml, equivalents for docx/pptx) — never grep across cell content / body content; legitimate data may match terms by coincidence. For other types, grep against the exiftool JSON output. Zero hits required. - Round-trip integrity —
base64 <out> | base64 -D | shasum -a 256matchesshasum -a 256 <out>. Confirms file content is fully self-contained (xattrs are not in the bytes).
Execution: Direct Verify: all three pass
Step 7: Report
Print a markdown table of what was removed, by category. Include the input path, output path, sha256, and any warnings (e.g. macOS com.apple.provenance re-attached, or unknown type fell through to universal pass).
Step 8: Reveal in Finder
open -R <out> reveals and selects the scrubbed file in Finder.
Step 9: Warn
Always print as the final line:
Do not open this file in its native app before sending. Excel / Word / Preview / Keynote re-inject
creator,lastModifiedBy, and absolute paths on save. Send as-is from the output location.
Hard Rules
- Never modify the input file. Always copy to
--out. Even--in-placewrites to a temp + atomic rename. - Never trust the file extension for type detection.
file --mime-type -bis the source of truth. - Grep leak terms only in metadata files / metadata extracts, never in body / cell / page content. Legitimate content can match terms by coincidence (e.g. an author named "Dmitrievna" matching "Dmitry"). Reporting false positives erodes trust in the scrub.
- Standing leak terms live in the gitignored personal overlay (
~/Documents/GitHub/claude/.claude/skills/personal/scrub-meta/leak-terms.txt), never in this skill's tracked files. Names of partners and personal paths must not ship publicly. com.apple.provenanceis expected to reattach afterxattr -cr. Don't loop trying to strip it. It's opaque, contains no PII, and is removed by every network transport. Document the residual in the report.- Office apps re-inject metadata on save. The user must be warned every time, on the final line of the report.
- GPS in images is non-negotiable — fail the verify step if any
GPS*field survives an image scrub. - LibreOffice / qpdf / ffmpeg are required for some branches. If
soffice/qpdf/ffmpegare not inPATH, fail loudly with the install hint (brew install libreoffice qpdf ffmpeg) instead of silently degrading.