docx-reader

star 1

Comprehensive Word document (.docx) reading skill that extracts full content including text, mathematical formulas (OMML→LaTeX), and embedded images. Use when: needing to READ, ANALYZE, or REVIEW a Word document's complete content including formulas and figures. Triggers include: user provides a .docx file for review, asks to 'read', 'check', 'review', or 'analyze' a Word document, or when formula/image extraction from .docx is needed. This skill is for READING documents; for CREATING or EDITING .docx files, use the 'docx' skill instead.

keypeo By keypeo schedule Updated 3/17/2026

name: docx-reader description: "Comprehensive Word document (.docx) reading skill that extracts full content including text, mathematical formulas (OMML→LaTeX), and embedded images. Use when: needing to READ, ANALYZE, or REVIEW a Word document's complete content including formulas and figures. Triggers include: user provides a .docx file for review, asks to 'read', 'check', 'review', or 'analyze' a Word document, or when formula/image extraction from .docx is needed. This skill is for READING documents; for CREATING or EDITING .docx files, use the 'docx' skill instead."

Comprehensive Word Document Reader

Overview

Read .docx files with zero information loss using a 4-step pipeline:

  1. Pandoc → Markdown with LaTeX math
  2. XML Parser → OMML formulas → LaTeX (fallback + precision)
  3. Image Extraction → All embedded images from word/media/
  4. Page Rendering → PDF → per-page JPEG (visual inspection)

Quick Start

Run the extraction script:

python scripts/read_docx.py <input.docx> <output_dir>

This generates:

File Content
content.md Full text with inline $LaTeX$ formulas (from XML parser)
content_pandoc.md Pandoc-converted markdown (if pandoc available)
images/ All embedded images (PNG, JPEG, EMF, etc.)
pages/ Per-page rendered JPEG images (if LibreOffice available)
metadata.json Summary of extraction results

Reading Workflow

After running the script, follow this order to read the document:

1. Read content.md for text and formulas

view_file <output_dir>/content.md

This contains the full document text with formulas converted to LaTeX. Use this as the primary source for understanding the document's content and mathematical expressions.

2. Compare with content_pandoc.md if available

Pandoc may capture some formatting (headings, lists, tables) more accurately. Cross-reference when the XML-parsed version seems incomplete.

3. View embedded images

view_file <output_dir>/images/image1.png

Check each image to understand figures, charts, and diagrams.

4. View page renders for layout verification

view_file <output_dir>/pages/page-001.jpg

Use page images to verify overall layout, formula rendering, and figure placement as they appear in the final document.

Dependencies

Tool Required Purpose Install
Python 3 ✅ Yes Script execution Pre-installed
pandoc ✅ Yes Text + math extraction winget install JohnMacFarlane.Pandoc
python-docx Optional Advanced text extraction pip install python-docx
Pillow Optional Image processing pip install Pillow
LibreOffice Optional PDF conversion for page renders Manual install
pdftoppm Optional PDF → page images Comes with Poppler
PyMuPDF Optional PDF → page images (alt) pip install PyMuPDF

Minimum setup: Python + pandoc handles ~90% of cases.

Formula Handling Details

The XML parser converts Office Math Markup Language (OMML) to LaTeX:

OMML Element LaTeX Output
<m:f> (fraction) \frac{num}{den}
<m:rad> (radical) \sqrt{x} or \sqrt[n]{x}
<m:sSup> (superscript) x^{n}
<m:sSub> (subscript) x_{i}
<m:nary> (n-ary) \sum, \int, \prod
<m:d> (delimiter) \left( ... \right)

Limitation: MathType equations stored as OLE objects (WMF/EMF images) cannot be parsed to LaTeX. They will appear in the images/ folder instead.

Troubleshooting

  • Garbled formulas: Some complex nested OMML may not convert perfectly. Cross-check with content_pandoc.md or page renders.
  • Missing images: EMF/WMF formats may not render in all viewers. Convert to PNG if needed.
  • No page renders: Install LibreOffice + Poppler for full page-image support.
Install via CLI
npx skills add https://github.com/keypeo/docx-reader-skill --skill docx-reader
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator