name: docx-reader description: "Comprehensive Word document (.docx) reading skill that extracts full content including text, mathematical formulas (OMML→LaTeX), and embedded images. Use when: needing to READ, ANALYZE, or REVIEW a Word document's complete content including formulas and figures. Triggers include: user provides a .docx file for review, asks to 'read', 'check', 'review', or 'analyze' a Word document, or when formula/image extraction from .docx is needed. This skill is for READING documents; for CREATING or EDITING .docx files, use the 'docx' skill instead."
Comprehensive Word Document Reader
Overview
Read .docx files with zero information loss using a 4-step pipeline:
- Pandoc → Markdown with LaTeX math
- XML Parser → OMML formulas → LaTeX (fallback + precision)
- Image Extraction → All embedded images from
word/media/ - Page Rendering → PDF → per-page JPEG (visual inspection)
Quick Start
Run the extraction script:
python scripts/read_docx.py <input.docx> <output_dir>
This generates:
| File | Content |
|---|---|
content.md |
Full text with inline $LaTeX$ formulas (from XML parser) |
content_pandoc.md |
Pandoc-converted markdown (if pandoc available) |
images/ |
All embedded images (PNG, JPEG, EMF, etc.) |
pages/ |
Per-page rendered JPEG images (if LibreOffice available) |
metadata.json |
Summary of extraction results |
Reading Workflow
After running the script, follow this order to read the document:
1. Read content.md for text and formulas
view_file <output_dir>/content.md
This contains the full document text with formulas converted to LaTeX. Use this as the primary source for understanding the document's content and mathematical expressions.
2. Compare with content_pandoc.md if available
Pandoc may capture some formatting (headings, lists, tables) more accurately. Cross-reference when the XML-parsed version seems incomplete.
3. View embedded images
view_file <output_dir>/images/image1.png
Check each image to understand figures, charts, and diagrams.
4. View page renders for layout verification
view_file <output_dir>/pages/page-001.jpg
Use page images to verify overall layout, formula rendering, and figure placement as they appear in the final document.
Dependencies
| Tool | Required | Purpose | Install |
|---|---|---|---|
| Python 3 | ✅ Yes | Script execution | Pre-installed |
| pandoc | ✅ Yes | Text + math extraction | winget install JohnMacFarlane.Pandoc |
| python-docx | Optional | Advanced text extraction | pip install python-docx |
| Pillow | Optional | Image processing | pip install Pillow |
| LibreOffice | Optional | PDF conversion for page renders | Manual install |
| pdftoppm | Optional | PDF → page images | Comes with Poppler |
| PyMuPDF | Optional | PDF → page images (alt) | pip install PyMuPDF |
Minimum setup: Python + pandoc handles ~90% of cases.
Formula Handling Details
The XML parser converts Office Math Markup Language (OMML) to LaTeX:
| OMML Element | LaTeX Output |
|---|---|
<m:f> (fraction) |
\frac{num}{den} |
<m:rad> (radical) |
\sqrt{x} or \sqrt[n]{x} |
<m:sSup> (superscript) |
x^{n} |
<m:sSub> (subscript) |
x_{i} |
<m:nary> (n-ary) |
\sum, \int, \prod |
<m:d> (delimiter) |
\left( ... \right) |
Limitation: MathType equations stored as OLE objects (WMF/EMF images) cannot be parsed to LaTeX. They will appear in the images/ folder instead.
Troubleshooting
- Garbled formulas: Some complex nested OMML may not convert perfectly. Cross-check with
content_pandoc.mdor page renders. - Missing images: EMF/WMF formats may not render in all viewers. Convert to PNG if needed.
- No page renders: Install LibreOffice + Poppler for full page-image support.