pdf-to-text - SKILL.md Agent Skill

name: pdf-to-text description: "Convert PDF files to clean text. Handles both embedded-text PDFs and scanned/image PDFs via OCR. Use when the user wants to import, extract, or convert a PDF to text." allowed-tools: Bash, Read, Write argument-hint: [pdf-path] [output-path]

Check if PDF has embedded text
```
pdftotext <input.pdf> - | head -20
```
- If output contains readable text → use pdftotext (fast, accurate)
- If output is empty or garbled → fall back to OCR
Test scan first two pages and evaluate quality
```
pdftotext -f 1 -l 2 <input.pdf> -
```
Read the output and check for:
- Garbled characters, mojibake, or encoding issues
- Missing words, merged lines, or broken paragraphs
- Headers/footers bleeding into body text
- Table or column data mangled into single lines
If quality is poor with pdftotext, try OCR on the same two pages:
```
pdftoppm -f 1 -l 2 -png <input.pdf> /tmp/test-scan
tesseract /tmp/test-scan-1.png - 2>/dev/null
rm /tmp/test-scan-*.png
```
Compare both outputs. Report findings to user before proceeding with full extraction.

If neither produces clean output, visually inspect the pages:
```
pdftoppm -f 1 -l 2 -png -r 200 <input.pdf> /tmp/visual-check
```
Use the Read tool to view /tmp/visual-check-1.png and /tmp/visual-check-2.png (Claude vision will render them). Compare what you see on the page to what the text extraction produced. Identify the specific issue (columns, watermarks, unusual fonts, embedded images of text, etc.) and recommend an extraction strategy before proceeding.
```
rm /tmp/visual-check-*.png
```

Extract text

Embedded text (preferred):

pdftotext <input.pdf> <output.txt>

Scanned/image PDF (OCR fallback):

ocrmypdf --force-ocr <input.pdf> /tmp/ocr-temp.pdf
pdftotext /tmp/ocr-temp.pdf <output.txt>
rm /tmp/ocr-temp.pdf

If ocrmypdf is not installed:

# Convert pages to images, then OCR
pdftoppm <input.pdf> /tmp/pdf-page -png
tesseract /tmp/pdf-page-*.png <output> txt
rm /tmp/pdf-page-*.png

Report results
```
wc -l <output.txt>
pdfinfo <input.pdf> | grep Pages
```
- Report page count and line count to user

$ARGUMENTS[0] — path to input PDF (required)
$ARGUMENTS[1] — path to output txt file (optional, defaults to same name with .txt extension)

Requires at least one of:

Check availability before starting:

which pdftotext ocrmypdf tesseract pdftoppm 2>/dev/null

If missing tools, tell the user what to install: