name: xlsx description: Use when the user asks to create, transform, validate, chart, preview, or password-protect Microsoft Excel .xlsx workbooks. Triggers include "csv to xlsx", "recalculate this workbook", "scan formula errors", "add a chart to xlsx", "bar / line / pie chart over a range", "financial model in xlsx", "fix #REF errors", "preview xlsx as image", "encrypt / decrypt / password-protect an xlsx", and related spreadsheet generation, recalculation, or OOXML round-trip tasks. tier: 2 version: 1.0 license: LicenseRef-Proprietary
xlsx skill
Purpose: Give the agent a deterministic, script-first path for
creating and sanity-checking .xlsx workbooks. The core operations
(CSV → styled .xlsx, force formula recalculation, scan for formula
errors, structural OOXML validation) are wrapped in small CLIs so the
agent does not have to rewrite openpyxl boilerplate on every task and
is never surprised by "formulas are stored but not calculated" (the
single most common xlsx bug).
1. Red Flags (Anti-Rationalization)
STOP and READ THIS if you are thinking:
- "I'll just call
DataFrame.to_exceland ship it." → WRONG.to_excelwrites no styles, no frozen header, and no auto-filter. The result looks amateur. Usecsv2xlsx.py. - "I'll just call
pd.DataFrame.from_records(rows).to_excel(out)on my LLM JSON output." → WRONG. Sameto_excelstyling gap, plus pandas'infer_objectsheuristics silently promote mixed-type columns toobject/float64(anintcolumn with onenullbecomesfloat64). Usejson2xlsx.py— preserves native JSON types, ISO-date auto-coercion, csv2xlsx-style header. - "I wrote formulas with openpyxl, so the numbers are there." → WRONG.
openpyxlstores formulas as strings with no cached value. Every downstream consumer (pandas, charts, external apps) seesNone. Runxlsx_recalc.pybefore shipping. - "Validation says OK, the formulas must be fine." → WRONG.
xlsx_validate.pyscans for cached error values. If there are no cached values at all (fresh openpyxl output), the scan is meaningless. Use--fail-emptyor recalc first. - "Leading zeros in my phone-number column vanished; it's fine." → WRONG. It is almost never fine. Excel and pandas both coerce
"007"to7by default.csv2xlsx.pydetects leading-zero columns and keeps them as text; inline code should either passdtype=strto pandas or setcell.number_format = "@"in openpyxl. - "I'll just regex-replace
\|to extract markdown tables." → WRONG. GFM pipe parsing has 5+ edge cases (escaped pipes\|, optional trailing pipes, separator-row alignment markers:---:/---:/:---, column-count validation, blockquote skip, fenced-code-block skip) that regex hand-rolls always fail. Usemd_tables2xlsx.py. For HTML<table>blocks: same tool —lxml.htmlis locked withno_network=True, huge_tree=False.
2. Capabilities
- Convert a CSV / TSV to a styled
.xlsxwith bold header, freeze-first-row, auto-filter, auto column widths, and leading-zero preservation. - Convert an
.xlsxworkbook back to CSV (per-sheet / per-region) or JSON (flat / dict-of-arrays / nestedtables) (xlsx2csv.py,xlsx2json.py). Thin shims on top of thexlsx_read/foundation library (xlsx-10.A). Supports--sheet NAME|all,--include-hidden,--header-rows N|auto,--tables whole|listobjects|gap|auto,--merge-policy anchor-only|fill|blank,--include-hyperlinks(JSON: dict-form{"value", "href"}; CSV: markdown-link-as-text[text](url)),--include-formulas,--datetime-format ISO|excel-serial|raw. JSON shapes: single-sheet single-region flat array; multi-sheet dict-of-arrays; multi-sheet multi-region nested{Sheet: {tables: {Name: [...]}}}; single-sheet multi-region flat{Name: [...]}. CSV multi-region requires--output-dir→ subdirectory schema<dir>/<sheet>/<table>.csv(sheet names with__are NOT split — L4 lock). Round-trip withjson2xlsx.py(xlsx-2) for Shape 1 / Shape 2 is byte-identical (live intests/test_json2xlsx.py::TestRoundTripXlsx8::test_live_roundtrip); Shape 3 / Shape 4 are lossy on xlsx-2 v1 consume (full restoration deferred to xlsx-2 v2--write-listobjects). Cross-cutting parity: cross-3 / cross-4 / cross-5 / cross-7 envelopes; path-traversal guard on--output-dir. Public helpersconvert_xlsx_to_csv(input, output=None, **kwargs)andconvert_xlsx_to_json(input, output=None, **kwargs)inxlsx2csv2json/__init__.py. - Convert a JSON / JSONL document (file or stdin
-) to a styled.xlsxwith the SAME visual contract as csv2xlsx (json2xlsx.py). Three input shapes auto-detected: array-of-objects (single sheet), dict-of-arrays-of-objects (multi-sheet), JSONL (one JSON object per line —.jsonlextension). Preserves native JSON types (int / float / bool / null / str); ISO-8601 date strings auto-coerced to Excel datetime cells;--strict-datesmakes timezone-aware datetimes a hard fail. Cross-cutting parity: cross-5--json-errorsenvelope, cross-7 H1 same-path guard (exit 6SelfOverwriteRefused), stdin pipe-. Round-trip contract withxlsx2json.py(xlsx-8) is frozen atskills/xlsx/references/json-shapes.mdand is LIVE as of xlsx-8 merge. - Convert an
.xlsxworkbook back to Markdown (GFM pipe tables, HTML<table>blocks, or per-table hybrid auto-select) (xlsx2md.py). Thin shim on top of thexlsx_read/foundation (xlsx-10.A) — symmetric pair tomd_tables2xlsx.pyclosing thexlsx → md → edits → xlsxround-trip. Supports--sheet NAME|all,--include-hidden,--format gfm|html|hybrid(defaulthybrid),--header-rows N|auto|smart,--memory-mode auto|streaming|full,--hyperlink-scheme-allowlist http,https,mailto(Sec-MED-2 default-enabled),--no-table-autodetect,--no-split,--gap-rows N --gap-cols N,--gfm-merge-policy fail|duplicate|blank,--datetime-format ISO|excel-serial|raw,--include-formulas(HTML only — M7 lock: GFM + formulas → exit 2). Hybrid promotion rules (per-table, first match wins): body merges → HTML; multi-row header → HTML;--include-formulas+ formula cell → HTML;headerRowCount=0(synthetic headers) → HTML (D13). Otherwise GFM. Multi-row<thead>reconstruction from xlsx_read's flat›-joined headers (D-A11). Hyperlinks via Path C′ (parallel pass through openpyxl) preserve display text — emit as[text](url)GFM or<a href="...">text</a>HTML; blocked schemes emit text-only + warning. Round-trip contract frozen atreferences/xlsx-md-shapes.md; LIVE round-trip viaTestRoundTripXlsx9::test_live_roundtrip_xlsx_md(cell-content byte-identical for non-merged plain tables; sheet-name asymmetryHistory → History_is xlsx-3's documented sanitisation, NOT a regression). Cross-cutting parity: cross-3/4/5/7 envelopes; terminalInternalErrorredaction (R23f). Public helperconvert_xlsx_to_md(input, output=None, **kwargs)inxlsx2md/__init__.py. - Extract markdown tables from a
.mddocument (file or stdin-) into a styled multi-sheet.xlsx(md_tables2xlsx.py). Two table flavors auto-detected: GFM pipe tables (with per-column alignment carried to Excel cell alignment) and HTML<table>blocks (withcolspan/rowspanhonoured as Excel merged cells). Sheet names derive from the nearest preceding markdown or HTML heading; fallbackTable-N; UTF-16-aware truncate to Excel's 31-char limit; case-insensitive workbook-wide dedup via-2..-99suffix. Default-on numeric + ISO-date coercion (csv2xlsx + json2xlsx parity; leading-zero preservation; aware datetimes → UTC-naive). Cross-cutting parity:--json-errorscross-5 envelope, cross-7 H1 same-path guard (exit 6), stdin pipe-. Pre-scan strips fenced code blocks, HTML comments, indented code blocks,<style>/<script>blocks, and blockquoted tables — so tables inside those regions never reach the parser. HTML parser islxml.html.HTMLParser(no_network=True, huge_tree=False, recover=True)singleton (defense-in-depth against XXE + libxml2 huge-tree expansion). Public helperconvert_md_tables_to_xlsx(input_path, output_path, **kwargs) -> intlives inmd_tables2xlsx/__init__.py(mirrors xlsx-2convert_json_to_xlsx; VDD-multi atomic-token protection inherited). - Force formula recalculation in an
.xlsxvia headless LibreOffice, then optionally scan for error cells. - Scan an
.xlsxfor formula errors (#REF!,#DIV/0!,#VALUE!,#NAME?,#N/A,#NUM!,#NULL!) without recomputing. - Add a bar / line / pie chart on a value range with optional categories, title, anchor; stays editable in Excel / LibreOffice.
- Insert an Excel comment (legacy
<comment>, optionally with the threaded-comment + personList Excel-365 modern layer) into a target cell, with cross-sheet--cellsyntax and a batch mode that auto-detects the xlsx-7 findings envelope. Closes the "validation-агент расставляет замечания" pipeline together withxlsx_check_rules.py(xlsx-7). - Declarative business-rule validation —
xlsx_check_rules.pyreads a YAML/JSON rules file alongside an.xlsxand emits a machine-readable findings envelope ({ok, summary, findings}) on stdout; pipes directly intoxlsx_add_comment.py --batch -for cell-comment placement. Optional--outputwrites a workbook copy with aRemarkscolumn + per-severity PatternFill. Hardened DSL (closed AST, noeval/exec), ReDoS-lint reject-list, billion-laughs YAML alias rejection, and a 100K-row × 10-rule perf contract (≤ 30 s wall-clock). - Unpack and repack
.xlsxarchives for raw OOXML editing (sharedoffice/module with the docx skill). - Structurally validate an
.xlsx(relationships, content types, required parts, package-layout allow-list — every ZIP entry must live under[Content_Types].xml,_rels/,xl/,docProps/, orcustomXml/per ECMA-376 §11.3.10; catches scratch-file leaks that Excel refuses to open. With--strict, the package-layout warning is promoted to exit 1). - Reject password-protected and legacy
.xls(CFB-container) inputs early in the reader scripts (xlsx_recalc.py,xlsx_validate.py,xlsx_add_chart.py,office/validate.py,office/unpack.py,preview.py) with a clear remediation message (exit 3) instead of aBadZipFiletraceback.csv2xlsx.pyandoffice_passwd.pyare not gated — the former takes CSV/TSV input (no encryption to detect), the latter is the encryption tool itself. - Detect macro-enabled inputs (
.xlsm, withxl/vbaProject.bin) and warn when the chosen output extension would silently drop the macros (xlsm→xlsx). - Render any
.xlsx/.xlsm/.pdf(or peer-skill.docx/.pptx) into a single PNG-grid preview viapreview.py(LibreOffice + Poppler). - Emit failures as machine-readable JSON to stderr with
--json-errors(uniform across all four office skills). - Set or remove a password on a
.xlsx/.docx/.pptx(MS-OFB Agile, Office 2010+) viaoffice_passwd.py— three modes:--encrypt PASSWORD,--decrypt PASSWORD,--check(exit 0 encrypted / 10 clean / 11 missing).
3. Execution Mode
- Mode:
script-first. - Why this mode: Spreadsheet tasks have well-defined inputs and outputs and benefit heavily from deterministic CLIs. Writing the styling code inline produces ugly workbooks every time; delegating to scripts frees the agent to focus on the user's intent.
4. Script Contract
- Commands:
python3 scripts/csv2xlsx.py INPUT.csv OUTPUT.xlsx [--delimiter auto|,|;|\t] [--encoding utf-8] [--no-freeze] [--no-filter]python3 scripts/xlsx2csv.py INPUT.xlsx [OUTPUT.csv|-] [--output-dir DIR] [--sheet NAME|all] [--include-hidden] [--header-rows N|auto] [--merge-policy anchor-only|fill|blank] [--tables whole|listobjects|gap|auto] [--gap-rows N] [--gap-cols N] [--include-hyperlinks] [--include-formulas] [--datetime-format ISO|excel-serial|raw] [--memory-mode auto|streaming|full] [--encoding utf-8|utf-8-sig] [--delimiter ,|;|tab|\\t|pipe] [--drop-empty-rows] [--json-errors]— convert .xlsx → CSV. Multi-region requires--output-dir. Output-or omitted → stdout.--drop-empty-rowsfilters out rows where every value isNone/""(default off; same semantics as the JSON path). For full Excel-RU/EU double-click compatibility pass BOTH--encoding utf-8-sigAND--delimiter ';': the BOM lets Excel detect UTF-8 encoding, and;is the field separator on RU/EU locales (where,is the decimal separator). Defaults (utf-8+,) stay pandas-/jq-friendly. Stdout always plain UTF-8 regardless of flags. Thetabalias (or 2-char\\tescape) maps to a real tab;pipemaps to|. Python-helper equivalent:convert_xlsx_to_csv(input, output, delimiter=';', encoding='utf-8-sig')(kwarg names map to CLI flags 1:1).python3 scripts/xlsx2json.py INPUT.xlsx [OUTPUT.json|-] [--sheet NAME|all] [--include-hidden] [--header-rows N|auto|leaf|smart] [--header-flatten-style string|array] [--merge-policy anchor-only|fill|blank] [--tables whole|listobjects|gap|auto] [--gap-rows N] [--gap-cols N] [--include-hyperlinks] [--include-formulas] [--datetime-format ISO|excel-serial|raw] [--memory-mode auto|streaming|full] [--drop-empty-rows] [--json-errors]— convert .xlsx → JSON. Output-or omitted → stdout.--drop-empty-rowsfilters rows where every value isNone/"".--header-rows leafauto-detects the header band but keeps ONLY the deepest non-empty level per column as the JSON key — solves the layout-heavy-report key bloat where rows 1..K-1 are merged metadata banners (title / customer / contract / project / period) and the real column names sit on row K. Example recipe:xlsx2json.py FILE.xlsx out.json --sheet all --tables auto --header-rows leaf --drop-empty-rows→ keys become["Date", "Hours", "Days", "Specialist", "Position", "Task #", "Description"]instead of 7-level›-concatenated metadata-prefixed strings.--header-rows smart(xlsx-8a-09 / R11; iter-3 2026-05-13) is the "find-the-data-table" recipe — locates the real header row whether or not merges are present. Uses a type-pattern heuristic: scores each top row by string-ratio + column-coverage + downstream type-stability + depth (max 5.0); when a candidate scores ≥ 3.5 ANDlen(sample_below) ≥ 2, the library shifts the region past the metadata and treats the candidate row as a 1-row header.smartdoes NOT defer to merge-based detection — it competes purely on score. On merged-banner fixtures (e.g.A1:C1"2026 plan" overQ1/Q2/Q3),smartshifts to the sub-header row and produces leaf-like keys["Q1", "Q2", "Q3"]. Callers needing the multi-level concatenated form ("2026 plan › Q1") must use--header-rows auto(orleafto take the deepest level only) —smartis non-overlapping withauto/leafat the OUTPUT shape level. Usesmartwhenautoproduces{"": ..., "От": ..., "До": ..., "__2": ..., ..., "__N": ...}-style keys on a workbook whose real headers are buried below a config section (typical: financial-modeling sheets withОт/Доparameter ranges on rows 1-K + a real data table starting at row K+1; or per-row banner + per-row metadata header rows above the column-name row, e.g. masterdata Timesheet pattern).--memory-mode {auto,streaming,full}(xlsx-8a-11 / R13) exposes the read-mode selection:auto(default) lets the library pick based on file size (_DEFAULT_READ_ONLY_THRESHOLD = 100 MiB);streamingminimises RAM — measured 7.6× RAM reduction PLUS 2.6× wall-clock speedup on 15 MB multi-sheet workbooks (1188 MB / 32 s → 156 MB / 13 s, live subprocess measurement 2026-05-13) at the cost of merge-aware features (overlap detection, multi-row header band via merges, merge-policy fill) becoming no-ops — use on workbooks without merges OR where merge fidelity does not matter;fullforces non-streaming so all merge features work but RAM scales ~10× with file size.--include-hyperlinksis incompatible withstreaming(hyperlink extraction needs the non-streaming cell object); the combination emits a stderr warning and auto-overrides tofull. Practical recipe for large multi-sheet workbooks without critical merges:xlsx2json.py BIG.xlsx out.json --sheet all --tables whole --header-rows smart --memory-mode streaming --drop-empty-rows.python3 scripts/md_tables2xlsx.py INPUT.md OUTPUT.xlsx [--no-coerce] [--no-freeze] [--no-filter] [--allow-empty] [--sheet-prefix STR] [--encoding utf-8] [--json-errors]— markdown tables → multi-sheet xlsx.INPUTis a path or-for stdin.python3 scripts/xlsx2md.py INPUT.xlsx [OUTPUT.md|-] [--sheet NAME|all] [--include-hidden] [--format gfm|html|hybrid] [--header-rows N|auto|smart] [--memory-mode auto|streaming|full] [--hyperlink-scheme-allowlist SCHEMES] [--no-table-autodetect] [--no-split] [--gap-rows N] [--gap-cols N] [--gfm-merge-policy fail|duplicate|blank] [--datetime-format ISO|excel-serial|raw] [--include-formulas] [--json-errors]— convert.xlsx→ Markdown. Default--format hybridauto-selects GFM vs HTML per-table by 4 promotion rules (body merges / multi-row header / formulas+include / synthetic header). Output-or omitted → stdout. Hyperlinks ALWAYS extracted (D5;read_only_mode=Falsedefault) — pass--memory-mode=streamingto reduce RAM at the cost of unreliable hyperlinks.--hyperlink-scheme-allowlistdefaults tohttp,https,mailto(Sec-MED-2 default-enabled) —'*'allows all,""blocks all.--format=gfm + --include-formulas→ exit 2IncludeFormulasRequiresHTML(M7 lock);--format=gfm+ body merges +--gfm-merge-policy=fail(default) → exit 2GfmMergesRequirePolicy.--header-rows=smartruns the xlsx-8a-09 type-pattern heuristic to skip metadata banners; on--header-rows=auto(default),xlsx2md.pyauto-falls-back tosmartwhen the initial detection produces non-uniform header depths (D-A25 hot-patch 2026-05-14, surfaces xlsx-8a-09 transparently for the masterdata Timesheet banner-and-metadata pattern; one stderr warning per affected region). Round-trip safe withmd_tables2xlsx.pyfor non-merged plain tables (cell content byte-identical; sheet-name asymmetryHistory → History_is xlsx-3's documented sanitisation, seereferences/xlsx-md-shapes.md§6). Python-helper:convert_xlsx_to_md(input, output, **kwargs)inxlsx2md/__init__.py(--flag=valueatomic-token kwargs; mirrors xlsx-2 / xlsx-3 D7 lock).python3 scripts/xlsx_recalc.py INPUT.xlsx [--output OUT.xlsx] [--timeout 120] [--scan-errors] [--json]python3 scripts/xlsx_validate.py INPUT.xlsx [--json] [--fail-empty]python3 scripts/xlsx_add_chart.py INPUT.xlsx --type bar|line|pie --data RANGE [--categories RANGE] [--title TEXT] [--sheet NAME] [--anchor CELL] [--titles-from-data | --no-titles-from-data] [--output OUT.xlsx]python3 scripts/xlsx_add_comment.py INPUT.xlsx OUTPUT.xlsx (--cell REF --author NAME --text MSG | --batch FILE [--default-author NAME] [--default-threaded]) [--threaded | --no-threaded] [--initials INI] [--date ISO] [--allow-merged-target] [--json-errors]python3 scripts/xlsx_check_rules.py INPUT.xlsx --rules RULES.{json,yaml} [--sheet NAME | --all-sheets] [--visible-only] [--json | --human] [--max-findings N] [--summarize-after N] [--require-data] [--ignore-stale-cache] [--strict-aggregates] [--treat-numeric-as-date COL] [--treat-text-as-date COL] [--timeout SECONDS] [--no-strip-whitespace] [--no-table-autodetect] [--no-merge-info] [--output OUT.xlsx [--remark-column auto|LETTER|HEADER] [--remark-column-mode replace|append|new] [--streaming-output]] [--json-errors]python3 scripts/office/unpack.py INPUT.xlsx OUTDIR/python3 scripts/office/pack.py INDIR/ OUTPUT.xlsxpython3 scripts/office/validate.py INPUT.xlsx [--strict] [--json]python3 scripts/preview.py INPUT OUTPUT.jpg [--cols 3] [--dpi 110] [--gap 12] [--padding 24] [--label-font-size 14] [--soffice-timeout 240] [--pdftoppm-timeout 60]python3 scripts/office_passwd.py INPUT [OUTPUT] (--encrypt PASSWORD | --decrypt PASSWORD | --check)— pass-as PASSWORD to read it from stdin.- All scripts above accept
--json-errorsto emit failures as a single line of JSON on stderr ({v, error, code, type?, details?}). The schema versionvis currently1; argparse usage errors are routed through the same envelope (type:"UsageError").
- Inputs: positional paths; optional flags per command.
- Outputs: a single file at the named output path;
office/unpack.pyproduces a directory tree; validators print a report (or JSON). - Failure semantics: non-zero exit on missing input, invalid encoding, soffice errors, or formula errors (
xlsx_validate.pyreturns 1 when errors are present). Error detail goes to stderr. - Idempotency:
csv2xlsx.pyproduces the same workbook for the same input every time.xlsx_recalc.pyis idempotent on an already-recalculated workbook. Exception:office_passwd.py --encryptis intentionally non-deterministic — Office encryption uses a fresh random salt per run. - Dry-run support: not applicable.
5. Safety Boundaries
- Allowed scope: only the paths named on the command line; never write outside the requested output path.
- Default exclusions: do not fetch data from remote URLs; only read from the provided local file.
- Destructive actions:
xlsx_recalc.pyrewrites its input in place when--outputis omitted — the convention matches how users expect "recalculate this file" to behave, but make it explicit to the user if the file is important. - Optional artifacts:
office/schemas/is optional; validators fall back to structural checks.
6. Validation Evidence
- Local verification:
python3 -m venv .venv && source .venv/bin/activate && pip install -r scripts/requirements.txt— installs openpyxl, pandas, lxml, defusedxml.python3 scripts/csv2xlsx.py examples/fixture.csv /tmp/out.xlsx && python3 scripts/office/validate.py /tmp/out.xlsx— exit 0, validator reportsOK.python3 scripts/xlsx_validate.py /tmp/out.xlsx --fail-empty— exit 0 (no formula errors, values are concrete).python3 scripts/xlsx_recalc.py /tmp/out.xlsx --scan-errors --json | jq .ok— printstrueif LibreOffice is installed.
- Expected evidence:
/tmp/out.xlsx, validator reports, non-emptyxlsx_validate.py --json. - CI signal:
python3 ../../.claude/skills/skill-creator/scripts/validate_skill.py skills/xlsx— exit 0.
7. Instructions
7.1 Pick the script before writing inline code
- Check §10 Quick Reference first. Most common tasks (CSV→xlsx, recalc, error scan) already exist.
- Drop to inline
openpyxl/pandasonly when you need something the scripts don't cover (charts, conditional formatting, pivot sources).
7.2 Setup
- MUST run
bash scripts/install.shonce. It createsscripts/.venv/locally (nothing global), prints a warning for any missing system tool, and is idempotent. - External system tools (checked by
install.sh, installed manually per project plan §3.3 "внешние инструменты — не бандлятся"):- LibreOffice (
soffice) — required byxlsx_recalc.pyto populate cached formula values that openpyxl cannot compute. macOS:brew install --cask libreoffice. Debian:sudo apt install libreoffice --no-install-recommends. Fedora:sudo dnf install libreoffice. Commands that need it fail with a clear error until it's installed.
- LibreOffice (
7.3 Creating .xlsx from CSV
python3 scripts/csv2xlsx.py input.csv output.xlsxcovers 80% of cases.- Use
--delimiter ';'for European exports;--delimiter '\t'for TSV. - Preserve leading zeros automatically — the script detects them and keeps the column as text. If you need to force that behaviour for a specific column, pre-process the CSV or write a custom openpyxl pass with
cell.number_format = "@".
7.4 Producing a workbook with formulas
- Write the workbook with openpyxl:
cell.value = "=SUM(B2:B100)". Remember the formula is just a string. - MUST run
python3 scripts/xlsx_recalc.py file.xlsxbefore shipping. Otherwise consumers that read withdata_only=TrueseeNonefor every formula cell. - After recalc, run
python3 scripts/xlsx_validate.py file.xlsx. Any#REF!/#DIV/0!is a bug to fix, not a cosmetic issue.
7.5 Validating someone else's workbook
- Structural:
python3 scripts/office/validate.py file.xlsx. - Formula errors:
python3 scripts/xlsx_validate.py file.xlsx --fail-empty. The--fail-emptyflag catches the "formulas never recalculated" case.
7.6 Raw XML editing
Use the shared office/ module (same as docx). Typical reason to drop into XML: tweaking a specific cell style, patching a corrupted sharedStrings.xml, or adding a relationship the user's downstream tool expects.
8. Workflows (Optional)
CSV to a styled, ready-to-ship workbook:
- [ ] Inspect the CSV — check delimiter, encoding, leading zeros
- [ ] `python3 scripts/csv2xlsx.py data.csv out.xlsx`
- [ ] `python3 scripts/office/validate.py out.xlsx`
- [ ] Open in Excel/LibreOffice for a spot-check
Build a formula-driven model:
- [ ] Write workbook with openpyxl (cells + formulas)
- [ ] `python3 scripts/xlsx_recalc.py out.xlsx --scan-errors`
- [ ] Surface any error cells to the user
- [ ] `python3 scripts/xlsx_validate.py out.xlsx`
Audit an incoming .xlsx:
- [ ] `python3 scripts/office/validate.py in.xlsx`
- [ ] `python3 scripts/xlsx_validate.py in.xlsx --fail-empty`
- [ ] If --fail-empty fails, run `xlsx_recalc.py` first then revalidate
9. Best Practices & Anti-Patterns
| DO THIS | DO NOT DO THIS |
|---|---|
Use csv2xlsx.py — styled header, frozen row, auto-filter, leading zeros preserved. |
DataFrame.to_excel("out.xlsx") — unstyled, coerces leading zeros, no filter. |
Run xlsx_recalc.py after any openpyxl write that touches formulas. |
Ship the openpyxl output directly — cached values are None. |
| Combine pandas (fast ETL) with openpyxl (rich styling) in that order. | Mix xlsxwriter and openpyxl on the same file; styles diverge. |
Use @-format or dtype=str for code-like columns. |
Let pandas coerce "007" to 7. |
Rationalization Table
| Agent Excuse | Reality / Counter-Argument |
|---|---|
| "The user didn't ask for styling." | A freeze-first-row and bold header are table stakes. Unstyled to_excel output looks amateur. |
| "Formulas will recalculate when the user opens the file." | Yes, but consumers like pandas, charts, and schedulers do not open the file — they read cached values. Always recalc for shared files. |
| "I need a chart, the scripts don't cover it." | Correct — drop to inline openpyxl for charts. The scripts cover the 80%. |
| "Just loading in pandas is fine — it handles everything." | pandas reads cached values (data_only semantics), so stale caches become NaN. Recalc first. |
10. Quick Reference
| Task | Command |
|---|---|
| CSV → styled .xlsx | python3 scripts/csv2xlsx.py data.csv out.xlsx |
| .xlsx → CSV | python3 scripts/xlsx2csv.py file.xlsx out.csv [--sheet NAME] [--tables listobjects --output-dir DIR] |
| .xlsx → JSON | python3 scripts/xlsx2json.py file.xlsx out.json [--sheet all] [--header-rows auto|leaf|smart] [--tables auto] |
| .xlsx → Markdown | python3 scripts/xlsx2md.py file.xlsx out.md [--sheet all] [--format gfm|html|hybrid] [--header-rows N|auto] |
| Force formula recalc | python3 scripts/xlsx_recalc.py file.xlsx [--scan-errors] |
Scan for #REF!/#DIV/0!/... |
python3 scripts/xlsx_validate.py file.xlsx --fail-empty |
| Add bar/line/pie chart | python3 scripts/xlsx_add_chart.py file.xlsx --type bar --data B2:B10 [--categories A2:A10] [--title "..."] |
| Insert single comment | python3 scripts/xlsx_add_comment.py file.xlsx out.xlsx --cell A5 --author "..." --text "..." [--threaded] |
| Batch comments from xlsx-7 findings | python3 scripts/xlsx_add_comment.py file.xlsx out.xlsx --batch findings.json --default-author "..." |
| Validate against declarative rules | python3 scripts/xlsx_check_rules.py file.xlsx --rules rules.json --json |
| Pipe findings into batch comments | python3 scripts/xlsx_check_rules.py file.xlsx --rules rules.json --json | python3 scripts/xlsx_add_comment.py file.xlsx annotated.xlsx --batch - --default-author "Reviewer" |
| Workbook copy with Remarks column | python3 scripts/xlsx_check_rules.py file.xlsx --rules rules.json --output reviewed.xlsx --remark-column auto |
| Unpack for XML editing | python3 scripts/office/unpack.py file.xlsx unpacked/ |
| Repack | python3 scripts/office/pack.py unpacked/ file.xlsx |
| Structural validation (deep) | python3 scripts/office/validate.py file.xlsx [--json] [--strict] |
| Preview as PNG-grid | python3 scripts/preview.py file.xlsx preview.jpg [--cols 3] [--dpi 110] |
| Set password | python3 scripts/office_passwd.py clean.xlsx encrypted.xlsx --encrypt PASSWORD (use - to read from stdin) |
| Remove password | python3 scripts/office_passwd.py encrypted.xlsx clean.xlsx --decrypt PASSWORD |
| Detect password | python3 scripts/office_passwd.py file.xlsx --check (exit 0 encrypted / 10 clean / 11 missing) |
| Machine-readable failures | append --json-errors to any of the above |
11. Examples (Few-Shot)
Fixture: examples/fixture.csv.
Input — user request:
Convert
sales.csvinto a presentable.xlsxand make sure the numbers are usable as a pivot source.
Output — agent action:
python3 scripts/csv2xlsx.py sales.csv sales.xlsx
python3 scripts/office/validate.py sales.xlsx
python3 scripts/xlsx_validate.py sales.xlsx --fail-empty
The output has no formulas, so xlsx_validate.py --fail-empty is the
right check; all cells carry concrete values.
Input — user request:
The forecast workbook shows
#REF!in a bunch of cells. Find them all.
Output — agent action:
python3 scripts/xlsx_recalc.py forecast.xlsx # ensure cache is fresh
python3 scripts/xlsx_validate.py forecast.xlsx --json
Paste the JSON or the human report back to the user, grouped by error type and sheet.
Input — user request:
Run the validator on
timesheet.xlsxand drop a comment on every problematic cell so my team can triage in Excel.
Output — agent action (xlsx-7 → xlsx-6 pipeline; piped envelope):
# xlsx_check_rules.py emits a JSON envelope of findings; xlsx_add_comment.py
# auto-detects the {ok, summary, findings} shape and writes one comment
# per finding (skipping group-findings with row=null).
python3 scripts/xlsx_check_rules.py timesheet.xlsx --rules rules.json --json \
| python3 scripts/xlsx_add_comment.py timesheet.xlsx timesheet-annotated.xlsx \
--batch - --default-author "Validator Bot"
The --default-author flag is required for envelope shape (DEP-2);
without it the script exits with MissingDefaultAuthor. To attach the
comments as Excel-365 threaded comments instead of legacy bubbles, add
--default-threaded.
12. Resources
- references/financial-modeling-conventions.md — colour coding, number formats, formula hygiene, drivers layout.
- references/openpyxl-vs-pandas.md — which library for which job, read-mode pitfalls.
- references/formula-recalc-gotchas.md — why cached values matter, engines that recalc, engines that don't.
- scripts/csv2xlsx.py — CSV/TSV → styled workbook.
- scripts/xlsx_recalc.py — LibreOffice-backed formula recalculation + error scan.
- scripts/xlsx_validate.py — fast formula-error scan without recalc.
- scripts/xlsx_add_chart.py — bar / line / pie chart attachment over a cell range; chart stays editable in Excel / LibreOffice.
- scripts/xlsx_add_comment.py — insert an Excel comment (legacy + optional Excel-365 threaded) into a target cell; single-cell mode (
--cell) or batch mode (--batch, auto-detects xlsx-7 findings envelope). - references/comments-and-threads.md — OOXML data model behind
xlsx_add_comment.py: part graph, cell-syntax forms, the C1/M-1/M6 pitfalls list (read these before editing the scanner code), v1 honest-scope. - scripts/xlsx_check_rules.py — declarative business-rule validator. Reads YAML/JSON rules, emits
{ok, summary, findings}envelope; pipes intoxlsx_add_comment.py --batch -. Closed AST + ReDoS lint + billion-laughs alias rejection. Backed by scripts/xlsx_check_rules/ package. - references/xlsx-rules-format.md — full SPEC for
xlsx_check_rules.py: rule shape, scope vocabulary, check vocabulary, AST safety, output envelope, exit codes, honest-scope catalogue, regression battery anchors. - examples/check-rules-timesheet.json + examples/check-rules-timesheet.xlsx — worked SPEC §10 example, ready to run end-to-end (validate → pipe → annotate).
- scripts/preview.py — universal
INPUT → PNG-gridrenderer for.xlsx/.xlsm/.docx/.pptx/.pdf. Byte-identical across all four office skills. - scripts/office_passwd.py — set / remove / detect password protection on
.xlsx/.docx/.pptxvia msoffcrypto-tool (MS-OFB Agile, Office 2010+). Byte-identical across the three OOXML skills (not pdf — pdf has its own AcroForm encryption). Pass-as the password to read it from stdin. - scripts/_errors.py —
--json-errorsenvelope helper (schemav=1). - scripts/_soffice.py — LibreOffice subprocess wrapper.
- scripts/office/ — OOXML unpack/pack/validate, byte-identical copy from the docx skill (master — see CLAUDE.md §2). Includes deep
XlsxValidator(sheet chain, sst+styles index bounds, sheet-name uniqueness, orphan parts). - references/security.md — Security model & trust boundary (xlsx-8 / xlsx-8a). Canonical statement of trust-boundary assumptions, accepted-risk catalogue, and the parent-symlink + TOCTOU race in
_emit_multi_region(deferred code-fix, documented as known-limitation). Read before deploying in shared CI or multi-tenant build farms.
13. Known Limitations & Deferred Refactors
xlsx_read/↔xlsx_check_rules/reader duplication (xlsx-10.A, 2026-05-12). The xlsx-10.A foundation library (scripts/xlsx_read/) re-implements a portion of the reader logic also present in xlsx-7'sxlsx_check_rules/package (merge resolution, header detection, value extraction). This is intentional v1 scope to unblock xlsx-8 and xlsx-9 without re-validating xlsx-7's 311-unit / 113-E2E regression battery. The follow-up refactor (xlsx-7 consumesxlsx_read/) is tracked as xlsx-10.B in docs/office-skills-backlog.md. The xlsx-9 owner opens xlsx-10.B within 14 calendar days post-xlsx-9 merge; if missed, the duplication is promoted to permanent technical debt and this bullet is updated accordingly.- xlsx2md (v1): comments / charts / images / shapes /
pivots / data-validation / cell styles / diagonal borders /
sparklines / camera objects dropped; rich-text spans →
plain-text concat; formulas without cached value emit empty
(or
data-formulaattr in HTML when--include-formulas); hyperlinks always extracted (D5;read_only_mode=Falsedefault;--memory-mode=streamingoverride available); hyperlink scheme allowlist defaults{http, https, mailto}(Sec-MED-2 default-enabled in v1); sheet names emitted verbatim fromxl/workbook.xml/<sheets>(xlsx-3 round-trip may sanitise on write-back — documented inreferences/xlsx-md-shapes.md§6, NOT a regression). See docs/tasks/task-012-*.md for details.