docx-reader

name: docx-reader description: "Comprehensive Word document (.docx) reading skill that extracts full content including text, mathematical formulas (OMML→LaTeX), and embedded images. Use when: needing to READ, ANALYZE, or REVIEW a Word document's complete content including formulas and figures. Triggers include: user provides a .docx file for review, asks to 'read', 'check', 'review', or 'analyze' a Word document, or when formula/image extraction from .docx is needed. This skill is for READING documents; for CREATING or EDITING .docx files, use the 'docx' skill instead."

Comprehensive Word Document Reader

Overview

Read .docx files with zero information loss using a 4-step pipeline:

Pandoc → Markdown with LaTeX math
XML Parser → OMML formulas → LaTeX (fallback + precision)
Image Extraction → All embedded images from word/media/
Page Rendering → PDF → per-page JPEG (visual inspection)

Quick Start

Run the extraction script:

python scripts/read_docx.py <input.docx> <output_dir>

This generates:

File	Content
`content.md`	Full text with inline $LaTeX$ formulas (from XML parser)
`content_pandoc.md`	Pandoc-converted markdown (if pandoc available)
`images/`	All embedded images (PNG, JPEG, EMF, etc.)
`pages/`	Per-page rendered JPEG images (if LibreOffice available)
`metadata.json`	Summary of extraction results

Reading Workflow

After running the script, follow this order to read the document:

1. Read `content.md` for text and formulas

view_file <output_dir>/content.md

This contains the full document text with formulas converted to LaTeX. Use this as the primary source for understanding the document's content and mathematical expressions.

2. Compare with `content_pandoc.md` if available

Pandoc may capture some formatting (headings, lists, tables) more accurately. Cross-reference when the XML-parsed version seems incomplete.

3. View embedded images

view_file <output_dir>/images/image1.png

Check each image to understand figures, charts, and diagrams.

4. View page renders for layout verification

view_file <output_dir>/pages/page-001.jpg

Use page images to verify overall layout, formula rendering, and figure placement as they appear in the final document.

Dependencies

Tool	Required	Purpose	Install
Python 3	✅ Yes	Script execution	Pre-installed
pandoc	✅ Yes	Text + math extraction	`winget install JohnMacFarlane.Pandoc`
python-docx	Optional	Advanced text extraction	`pip install python-docx`
Pillow	Optional	Image processing	`pip install Pillow`
LibreOffice	Optional	PDF conversion for page renders	Manual install
pdftoppm	Optional	PDF → page images	Comes with Poppler
PyMuPDF	Optional	PDF → page images (alt)	`pip install PyMuPDF`

Minimum setup: Python + pandoc handles ~90% of cases.

Formula Handling Details

The XML parser converts Office Math Markup Language (OMML) to LaTeX:

OMML Element	LaTeX Output
`<m:f>` (fraction)	`\frac{num}{den}`
`<m:rad>` (radical)	`\sqrt{x}` or `\sqrt[n]{x}`
`<m:sSup>` (superscript)	`x^{n}`
`<m:sSub>` (subscript)	`x_{i}`
`<m:nary>` (n-ary)	`\sum`, `\int`, `\prod`
`<m:d>` (delimiter)	`\left( ... \right)`

Limitation: MathType equations stored as OLE objects (WMF/EMF images) cannot be parsed to LaTeX. They will appear in the images/ folder instead.

Troubleshooting

Garbled formulas: Some complex nested OMML may not convert perfectly. Cross-check with content_pandoc.md or page renders.
Missing images: EMF/WMF formats may not render in all viewers. Convert to PNG if needed.
No page renders: Install LibreOffice + Poppler for full page-image support.