name: ocr-document description: "Extract text from PDFs, images, and scanned documents. Uses pymupdf (local) or optional cloud OCR APIs." version: 1.0.0 metadata: echo: tags: [OCR, PDF, Document, Extract, Text]
OCR & Document Processing
Extract text from PDFs, scanned images, and documents.
PDF Text Extraction (PyMuPDF)
Best choice for text-based PDFs:
pip install pymupdf
import pymupdf
doc = pymupdf.open("file.pdf")
for page in doc:
text = page.get_text()
print(text)
# All pages at once
full_text = "\n".join(page.get_text() for page in doc)
PDF → Markdown (marker-pdf)
High-quality conversion preserving structure:
pip install marker-pdf
marker_single file.pdf output_dir/ --output_format markdown
Image OCR
Surya OCR (Modern ML-based, best for Chinese)
pip install surya-ocr
surya_ocr image.png --langs zh,en
Pytesseract (Traditional, widely available)
# Install Tesseract engine first
brew install tesseract tesseract-lang # macOS
apt install tesseract-ocr tesseract-ocr-chi-sim # Linux
pip install pytesseract Pillow
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(
Image.open("scan.png"),
lang="chi_sim+eng"
)
Script
python3 scripts/extract_document.py document.pdf
python3 scripts/extract_document.py scan.png
python3 scripts/extract_document.py report.pdf --output extracted.txt
Auto-detects format by extension: PDF → pymupdf, DOCX → python-docx, Image → pytesseract.
OCR language is controlled by system Tesseract config (e.g., chi_sim+eng default).
Tips
- For scanned PDFs, extract images first then OCR each page
- Preprocessing (deskew, contrast) improves OCR accuracy
- Chinese OCR: surya-ocr > pytesseract for accuracy