pdf-extract

star 69

Fast, zero-AI text extraction from PDFs that have a text layer (digitally created PDFs from Word, Typst, WeasyPrint, wkhtmltopdf, LaTeX, etc). Uses pymupdf (fitz) - instant and deterministic. Use when you need to quickly pull raw text from a known text-layer PDF, e.g. "extract text from this PDF", "read this PDF", "get the content of", "what does this PDF say", "quickly read this PDF". Do NOT use for scanned/image PDFs or when you need structured output (tables, headings, OCR, AI analysis) - use the pdf-processing-pro skill in this plugin for those cases.

henkisdabro By henkisdabro schedule Updated 5/15/2026

name: pdf-extract description: Fast, zero-AI text extraction from PDFs that have a text layer (digitally created PDFs from Word, Typst, WeasyPrint, wkhtmltopdf, LaTeX, etc). Uses pymupdf (fitz) - instant and deterministic. Use when you need to quickly pull raw text from a known text-layer PDF, e.g. "extract text from this PDF", "read this PDF", "get the content of", "what does this PDF say", "quickly read this PDF". Do NOT use for scanned/image PDFs or when you need structured output (tables, headings, OCR, AI analysis) - use the pdf-processing-pro skill in this plugin for those cases. allowed-tools: Bash, Read, Write

PDF Text Extraction

Extract text from PDF files using pymupdf via uv run --with pymupdf.

Prerequisites

Extract text from a single PDF

uv run --with pymupdf python3 -c "
import fitz
doc = fitz.open('/path/to/file.pdf')
for page in doc:
    text = page.get_text().strip()
    if text:
        print(text)
        print()
"

Extract and save to file

uv run --with pymupdf python3 -c "
import fitz

doc = fitz.open('/path/to/file.pdf')
pages = []
for page in doc:
    text = page.get_text().strip()
    if text:
        pages.append(text)

with open('/path/to/output.txt', 'w') as f:
    f.write('\n\n'.join(pages))

print(f'Extracted {len(pages)} pages')
"

Extract specific pages

uv run --with pymupdf python3 -c "
import fitz

doc = fitz.open('/path/to/file.pdf')
# Pages are 0-indexed
for i in range(2, 5):  # Pages 3-5
    text = doc[i].get_text().strip()
    if text:
        print(text)
"

Batch extract from multiple PDFs

uv run --with pymupdf python3 -c "
import fitz
import glob
import os

for pdf_path in glob.glob('/path/to/folder/*.pdf'):
    doc = fitz.open(pdf_path)
    text = '\n\n'.join(p.get_text().strip() for p in doc if p.get_text().strip())
    out_path = pdf_path.rsplit('.', 1)[0] + '.txt'
    with open(out_path, 'w') as f:
        f.write(text)
    print(f'{os.path.basename(pdf_path)}: {len(doc)} pages extracted')
"

Get PDF metadata

uv run --with pymupdf python3 -c "
import fitz
doc = fitz.open('/path/to/file.pdf')
meta = doc.metadata
print(f'Title: {meta.get(\"title\", \"N/A\")}')
print(f'Author: {meta.get(\"author\", \"N/A\")}')
print(f'Pages: {len(doc)}')
print(f'Creator: {meta.get(\"creator\", \"N/A\")}')
"

Key notes

  • pymupdf is imported as fitz (legacy naming from the MuPDF library)
  • Pages are 0-indexed: doc[0] is the first page
  • get_text() returns plain text; use get_text("blocks") for positioned blocks
  • get_text("html") returns HTML with formatting preserved
  • The package caches after the first uv run --with pymupdf invocation - subsequent runs are instant

When to use this vs pdf-processing-pro

Use pdf-extract Use pdf-processing-pro
PDF created digitally (Word, Typst, LaTeX, wkhtmltopdf, WeasyPrint) Scanned or image-based PDF (photo, fax, scan)
Need raw text quickly - less than a second Need structured output: tables, headings, forms
Bulk/batch extraction without AI cost OCR required (scanned documents)
Offline, no API key, no extra dependencies Form filling, validation, batch workflows
Simple text content, no tables needed Tables or structured layout are important
Install via CLI
npx skills add https://github.com/henkisdabro/wookstar-claude-plugins --skill pdf-extract
Repository Details
star Stars 69
call_split Forks 9
navigation Branch main
article Path SKILL.md
More from Creator