document-skills - SKILL.md Agent Skill

name: document-skills description: Document manipulation toolkit for DOCX, PDF, PPTX, and XLSX files. Create, edit, extract, and convert documents programmatically.

Document Skills

Overview

Comprehensive toolkit for creating, editing, and manipulating documents across multiple formats including Word (DOCX), PDF, PowerPoint (PPTX), and Excel (XLSX). Use this agent for professional document processing, text extraction, tracked changes, and content manipulation.

When to Use This Agent

Use this agent when:

Creating or editing Word documents (.docx)
Extracting text or tables from PDFs
Merging, splitting, or manipulating PDF files
Creating or modifying PowerPoint presentations
Reading or writing Excel spreadsheets
Converting between document formats
Implementing tracked changes in documents
Extracting data from document files

DOCX - Word Documents

Overview

A .docx file is a ZIP archive containing XML files and resources. Create, edit, or analyze Word documents using text extraction, raw XML access, or redlining workflows.

Reading and Analyzing Content

Text Extraction

# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept/reject/all

Raw XML Access

# Unpack a file
python ooxml/scripts/unpack.py <office_file> <output_directory>

Key file structures:

word/document.xml - Main document contents
word/comments.xml - Comments referenced in document.xml
word/media/ - Embedded images and media files
Tracked changes use <w:ins> (insertions) and <w:del> (deletions) tags

Creating New Word Documents

Use docx-js for creating documents from scratch:

Read docx-js.md for detailed syntax and examples
Create JavaScript/TypeScript file using Document, Paragraph, TextRun components
Export as .docx using Packer.toBuffer()

Editing Existing Documents

Use the Document library (Python) for editing:

Read ooxml.md for the Document library API
Unpack: python ooxml/scripts/unpack.py <office_file> <output_directory>
Create Python script using the Document library
Pack: python ooxml/scripts/pack.py <input_directory> <office_file>

Redlining Workflow for Document Review

CRITICAL: For complete tracked changes, implement ALL changes systematically.

Batching Strategy: Group related changes into batches of 3-10 changes.

Principle: Minimal, Precise Edits

Only mark text that actually changes
Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]
Preserve the original run's RSID for unchanged text

Workflow:

Convert to markdown: pandoc --track-changes=all path-to-file.docx -o current.md
Identify and group changes (by section, type, or proximity)
Read ooxml.md and unpack document
Implement changes in batches
Pack: python ooxml/scripts/pack.py unpacked reviewed-document.docx
Verify: pandoc --track-changes=all reviewed-document.docx -o verification.md

Converting DOCX to Images

# Convert DOCX to PDF
soffice --headless --convert-to pdf document.docx

# Convert PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page

PDF - Document Processing

Quick Start

from pypdf import PdfReader, PdfWriter

# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

# Extract text
text = ""
for page in reader.pages:
    text += page.extract_text()

Common Operations

Merge PDFs

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

Split PDF

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Extract Text with Layout

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Extract Tables

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)

Create PDFs

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter

c.drawString(100, height - 100, "Hello World!")
c.save()

Command-Line Tools

# Extract text
pdftotext input.pdf output.txt

# Merge with qpdf
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf

# Extract images
pdfimages -j input.pdf output_prefix

PPTX - PowerPoint Presentations

Overview

.pptx files are ZIP archives containing XML files for slides, layouts, themes, and media.

Text Extraction

# Convert to markdown
pandoc presentation.pptx -o output.md

Creating Presentations

Use pptxgenjs (JavaScript):

# Install
npm install pptxgenjs

# Create presentation
node create_presentation.js

Example:

const PptxGenJS = require("pptxgenjs");
const pptx = new PptxGenJS();

const slide = pptx.addSlide();
slide.addText("Hello World", { x: 1, y: 1, fontSize: 18 });
slide.addShape(pptx.ShapeType.rect, { x: 1, y: 2, w: 5, h: 3 });

pptx.writeFile({ fileName: "presentation.pptx" });

Editing Presentations

Use python-pptx:

from pptx import Presentation

# Load presentation
prs = Presentation('existing.pptx')

# Add slide
blank_slide_layout = prs.slide_layouts[6]
slide = prs.slides.add_slide(blank_slide_layout)

# Add text
title = slide.shapes.title
title.text = "New Slide Title"

prs.save('modified.pptx')

Raw XML Editing

For complex edits, unpack and edit XML directly:

# Unpack
python ooxml/scripts/unpack.py presentation.pptx unpacked/

# Edit ppt/slides/slide1.xml, ppt/presentation.xml, etc.

# Pack
python ooxml/scripts/pack.py unpacked/ presentation.pptx

XLSX - Excel Spreadsheets

Reading Excel Files

import pandas as pd

# Read entire sheet
df = pd.read_excel('file.xlsx')

# Read specific sheet
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')

# Read specific columns
df = pd.read_excel('file.xlsx', usecols=['A', 'B', 'C'])

Writing Excel Files

import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
})

# Write to Excel
df.to_excel('output.xlsx', index=False)

# Multiple sheets
with pd.ExcelWriter('output.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Sheet1')
    df2.to_excel(writer, sheet_name='Sheet2')

Advanced Excel Operations

from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill

# Load workbook
wb = load_workbook('file.xlsx')
ws = wb.active

# Modify cells
ws['A1'] = 'New Value'
ws['A1'].font = Font(bold=True)
ws['A1'].fill = PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid')

# Add formula
ws['B10'] = '=SUM(B1:B9)'

# Save
wb.save('modified.xlsx')

Quick Reference

Format	Task	Best Tool
DOCX	Create new	docx-js (JavaScript)
DOCX	Edit existing	Document library (Python)
DOCX	Extract text	pandoc
DOCX	Tracked changes	Redlining workflow
PDF	Extract text	pdfplumber
PDF	Extract tables	pdfplumber
PDF	Merge/split	pypdf or qpdf
PDF	Create	reportlab
PPTX	Create new	pptxgenjs
PPTX	Edit	python-pptx
PPTX	Extract	pandoc
XLSX	Read/Write	pandas
XLSX	Advanced edits	openpyxl

Dependencies

# DOCX
npm install -g docx
pip install defusedxml

# PDF
pip install pypdf pdfplumber reportlab
apt-get install pandoc poppler-utils qpdf

# PPTX
npm install pptxgenjs
pip install python-pptx

# XLSX
pip install pandas openpyxl