name: document-skills description: Document manipulation toolkit for DOCX, PDF, PPTX, and XLSX files. Create, edit, extract, and convert documents programmatically.
Document Skills
Overview
Comprehensive toolkit for creating, editing, and manipulating documents across multiple formats including Word (DOCX), PDF, PowerPoint (PPTX), and Excel (XLSX). Use this agent for professional document processing, text extraction, tracked changes, and content manipulation.
When to Use This Agent
Use this agent when:
- Creating or editing Word documents (.docx)
- Extracting text or tables from PDFs
- Merging, splitting, or manipulating PDF files
- Creating or modifying PowerPoint presentations
- Reading or writing Excel spreadsheets
- Converting between document formats
- Implementing tracked changes in documents
- Extracting data from document files
DOCX - Word Documents
Overview
A .docx file is a ZIP archive containing XML files and resources. Create, edit, or analyze Word documents using text extraction, raw XML access, or redlining workflows.
Reading and Analyzing Content
Text Extraction
# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept/reject/all
Raw XML Access
# Unpack a file
python ooxml/scripts/unpack.py <office_file> <output_directory>
Key file structures:
word/document.xml- Main document contentsword/comments.xml- Comments referenced in document.xmlword/media/- Embedded images and media files- Tracked changes use
<w:ins>(insertions) and<w:del>(deletions) tags
Creating New Word Documents
Use docx-js for creating documents from scratch:
- Read
docx-js.mdfor detailed syntax and examples - Create JavaScript/TypeScript file using Document, Paragraph, TextRun components
- Export as .docx using Packer.toBuffer()
Editing Existing Documents
Use the Document library (Python) for editing:
- Read
ooxml.mdfor the Document library API - Unpack:
python ooxml/scripts/unpack.py <office_file> <output_directory> - Create Python script using the Document library
- Pack:
python ooxml/scripts/pack.py <input_directory> <office_file>
Redlining Workflow for Document Review
CRITICAL: For complete tracked changes, implement ALL changes systematically.
Batching Strategy: Group related changes into batches of 3-10 changes.
Principle: Minimal, Precise Edits
- Only mark text that actually changes
- Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]
- Preserve the original run's RSID for unchanged text
Workflow:
- Convert to markdown:
pandoc --track-changes=all path-to-file.docx -o current.md - Identify and group changes (by section, type, or proximity)
- Read
ooxml.mdand unpack document - Implement changes in batches
- Pack:
python ooxml/scripts/pack.py unpacked reviewed-document.docx - Verify:
pandoc --track-changes=all reviewed-document.docx -o verification.md
Converting DOCX to Images
# Convert DOCX to PDF
soffice --headless --convert-to pdf document.docx
# Convert PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page
PDF - Document Processing
Quick Start
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()
Common Operations
Merge PDFs
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
Split PDF
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
Extract Text with Layout
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
Extract Tables
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)
Create PDFs
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
c.drawString(100, height - 100, "Hello World!")
c.save()
Command-Line Tools
# Extract text
pdftotext input.pdf output.txt
# Merge with qpdf
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
# Extract images
pdfimages -j input.pdf output_prefix
PPTX - PowerPoint Presentations
Overview
.pptx files are ZIP archives containing XML files for slides, layouts, themes, and media.
Text Extraction
# Convert to markdown
pandoc presentation.pptx -o output.md
Creating Presentations
Use pptxgenjs (JavaScript):
# Install
npm install pptxgenjs
# Create presentation
node create_presentation.js
Example:
const PptxGenJS = require("pptxgenjs");
const pptx = new PptxGenJS();
const slide = pptx.addSlide();
slide.addText("Hello World", { x: 1, y: 1, fontSize: 18 });
slide.addShape(pptx.ShapeType.rect, { x: 1, y: 2, w: 5, h: 3 });
pptx.writeFile({ fileName: "presentation.pptx" });
Editing Presentations
Use python-pptx:
from pptx import Presentation
# Load presentation
prs = Presentation('existing.pptx')
# Add slide
blank_slide_layout = prs.slide_layouts[6]
slide = prs.slides.add_slide(blank_slide_layout)
# Add text
title = slide.shapes.title
title.text = "New Slide Title"
prs.save('modified.pptx')
Raw XML Editing
For complex edits, unpack and edit XML directly:
# Unpack
python ooxml/scripts/unpack.py presentation.pptx unpacked/
# Edit ppt/slides/slide1.xml, ppt/presentation.xml, etc.
# Pack
python ooxml/scripts/pack.py unpacked/ presentation.pptx
XLSX - Excel Spreadsheets
Reading Excel Files
import pandas as pd
# Read entire sheet
df = pd.read_excel('file.xlsx')
# Read specific sheet
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
# Read specific columns
df = pd.read_excel('file.xlsx', usecols=['A', 'B', 'C'])
Writing Excel Files
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['NYC', 'LA', 'Chicago']
})
# Write to Excel
df.to_excel('output.xlsx', index=False)
# Multiple sheets
with pd.ExcelWriter('output.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')
Advanced Excel Operations
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill
# Load workbook
wb = load_workbook('file.xlsx')
ws = wb.active
# Modify cells
ws['A1'] = 'New Value'
ws['A1'].font = Font(bold=True)
ws['A1'].fill = PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid')
# Add formula
ws['B10'] = '=SUM(B1:B9)'
# Save
wb.save('modified.xlsx')
Quick Reference
| Format | Task | Best Tool |
|---|---|---|
| DOCX | Create new | docx-js (JavaScript) |
| DOCX | Edit existing | Document library (Python) |
| DOCX | Extract text | pandoc |
| DOCX | Tracked changes | Redlining workflow |
| Extract text | pdfplumber | |
| Extract tables | pdfplumber | |
| Merge/split | pypdf or qpdf | |
| Create | reportlab | |
| PPTX | Create new | pptxgenjs |
| PPTX | Edit | python-pptx |
| PPTX | Extract | pandoc |
| XLSX | Read/Write | pandas |
| XLSX | Advanced edits | openpyxl |
Dependencies
# DOCX
npm install -g docx
pip install defusedxml
# PDF
pip install pypdf pdfplumber reportlab
apt-get install pandoc poppler-utils qpdf
# PPTX
npm install pptxgenjs
pip install python-pptx
# XLSX
pip install pandas openpyxl