ocr - SKILL.md Agent Skill

name: ocr description: OCR images and PDFs to Markdown using Qwen3.5 VL via MLX on Apple Silicon. Use when the user needs to extract text from images, scanned PDFs, photos of documents, or any visual content with text. Supports PNG, JPG, TIFF, WebP, BMP, and PDF files. compatibility: macOS Apple Silicon only (requires MLX)

OCR — Visual Document to Markdown

Converts images and PDFs to Markdown using Qwen3.5 Vision-Language models running locally on Apple Silicon via MLX.

Best for: scanned documents, photos, complex layouts, CJK content, credit card bills, invoices, receipts — anything where traditional PDF text extraction fails.

Setup

Run once to install dependencies and pre-download the model (~2.5 GB):

bash {baseDir}/setup.sh

Convert a File

bash {baseDir}/convert.sh <input-file>

Supported formats: .pdf, .png, .jpg, .jpeg, .tiff, .tif, .webp, .bmp

Output: <input-file-without-ext>.md next to the original file.

Options

bash {baseDir}/convert.sh <input-file> [options]

Options:
  --think          Enable model reasoning (slower, sometimes better for complex layouts)
  --pages 1,3-5    Page range for PDFs (1-indexed, default: all pages)
  --output <path>  Custom output path (default: <input>.md)

Examples

# OCR a scanned PDF
bash {baseDir}/convert.sh /path/to/scan.pdf

# OCR a photo of a receipt
bash {baseDir}/convert.sh /path/to/receipt.jpg

# OCR specific pages of a PDF
bash {baseDir}/convert.sh /path/to/document.pdf --pages 1,3-5

Workflow

Run the convert script on the image or PDF
Read the resulting .md file to understand the content
For multi-page PDFs, pages are separated by --- in the output