name: pdf-extract description: Fast, zero-AI text extraction from PDFs that have a text layer (digitally created PDFs from Word, Typst, WeasyPrint, wkhtmltopdf, LaTeX, etc). Uses pymupdf (fitz) - instant and deterministic. Use when you need to quickly pull raw text from a known text-layer PDF, e.g. "extract text from this PDF", "read this PDF", "get the content of", "what does this PDF say", "quickly read this PDF". Do NOT use for scanned/image PDFs or when you need structured output (tables, headings, OCR, AI analysis) - use the pdf-processing-pro skill in this plugin for those cases. allowed-tools: Bash, Read, Write
PDF Text Extraction
Extract text from PDF files using pymupdf via uv run --with pymupdf.
Prerequisites
uvinstalled - see https://docs.astral.sh/uv/getting-started/installation/- No venv or pre-installation needed -
uv run --withhandles caching automatically
Extract text from a single PDF
uv run --with pymupdf python3 -c "
import fitz
doc = fitz.open('/path/to/file.pdf')
for page in doc:
text = page.get_text().strip()
if text:
print(text)
print()
"
Extract and save to file
uv run --with pymupdf python3 -c "
import fitz
doc = fitz.open('/path/to/file.pdf')
pages = []
for page in doc:
text = page.get_text().strip()
if text:
pages.append(text)
with open('/path/to/output.txt', 'w') as f:
f.write('\n\n'.join(pages))
print(f'Extracted {len(pages)} pages')
"
Extract specific pages
uv run --with pymupdf python3 -c "
import fitz
doc = fitz.open('/path/to/file.pdf')
# Pages are 0-indexed
for i in range(2, 5): # Pages 3-5
text = doc[i].get_text().strip()
if text:
print(text)
"
Batch extract from multiple PDFs
uv run --with pymupdf python3 -c "
import fitz
import glob
import os
for pdf_path in glob.glob('/path/to/folder/*.pdf'):
doc = fitz.open(pdf_path)
text = '\n\n'.join(p.get_text().strip() for p in doc if p.get_text().strip())
out_path = pdf_path.rsplit('.', 1)[0] + '.txt'
with open(out_path, 'w') as f:
f.write(text)
print(f'{os.path.basename(pdf_path)}: {len(doc)} pages extracted')
"
Get PDF metadata
uv run --with pymupdf python3 -c "
import fitz
doc = fitz.open('/path/to/file.pdf')
meta = doc.metadata
print(f'Title: {meta.get(\"title\", \"N/A\")}')
print(f'Author: {meta.get(\"author\", \"N/A\")}')
print(f'Pages: {len(doc)}')
print(f'Creator: {meta.get(\"creator\", \"N/A\")}')
"
Key notes
- pymupdf is imported as
fitz(legacy naming from the MuPDF library) - Pages are 0-indexed:
doc[0]is the first page get_text()returns plain text; useget_text("blocks")for positioned blocksget_text("html")returns HTML with formatting preserved- The package caches after the first
uv run --with pymupdfinvocation - subsequent runs are instant
When to use this vs pdf-processing-pro
| Use pdf-extract | Use pdf-processing-pro |
|---|---|
| PDF created digitally (Word, Typst, LaTeX, wkhtmltopdf, WeasyPrint) | Scanned or image-based PDF (photo, fax, scan) |
| Need raw text quickly - less than a second | Need structured output: tables, headings, forms |
| Bulk/batch extraction without AI cost | OCR required (scanned documents) |
| Offline, no API key, no extra dependencies | Form filling, validation, batch workflows |
| Simple text content, no tables needed | Tables or structured layout are important |