pdf-vision

star 4

PDF RAG pipeline: text extraction + Apple Vision OCR + API embedding + cosine search. Triggers: "pdf 읽기", "pdf rag", "문서 검색", "pdf 분석", "pdf 청크", "pdf-expert", "pdf 임베딩"

lidge-jun By lidge-jun schedule Updated 6/4/2026

name: pdf-vision description: > PDF RAG pipeline: text extraction + Apple Vision OCR + API embedding + cosine search. Triggers: "pdf 읽기", "pdf rag", "문서 검색", "pdf 분석", "pdf 청크", "pdf-expert", "pdf 임베딩"

PDF Expert Skill

When to use

  • Large PDF that won't fit in context (100+ pages)
  • Scanned/image PDF that needs OCR
  • Semantic search across one or multiple PDFs
  • Building a personal document knowledge base (NotebookLM-style)

Quick Reference

Task Command
Check text vs scan pages python3 <skill>/scripts/pdf_check.py "file.pdf"
Check with custom threshold python3 <skill>/scripts/pdf_check.py "file.pdf" --threshold 50 --quality-ratio 0.2
Extract scan pages as PNG pdftoppm -png -f 3 -l 3 "file.pdf" /tmp/page_3
OCR images (parallel) swift <skill>/scripts/ocr.swift 3 /tmp/page_*.png
Embed → JSON vectors python3 <skill>/scripts/pdf_embed.py merged.json "doc_name"
Re-embed (overwrite) python3 <skill>/scripts/pdf_embed.py merged.json "doc_name" --force
Semantic search (1 doc) python3 <skill>/scripts/pdf_query.py "질문" --file doc_name
Semantic search (all docs) python3 <skill>/scripts/pdf_query.py "질문"
Keyword search (free) python3 <skill>/scripts/pdf_query.py "키워드" --file doc_name --grep
Column-based OCR (optional) python3 <skill>/scripts/ocr_columns.py image.png --cols 3

Replace <skill> with the absolute path to this skill folder.

Workflow

Step 1: Assess PDF (pdf_check.py)

python3 <skill>/scripts/pdf_check.py "/path/to/doc.pdf"

Output:

{
  "file": "/absolute/path/to/doc.pdf",
  "total_pages": 10,
  "text_pages": [{"page": 1, "text": "..."}, {"page": 3, "text": "..."}],
  "ocr_pages": [2, 4, 5]
}
  • text_pages: pages with sufficient extractable text
  • ocr_pages: page numbers that need OCR
  • Optional flags: --threshold N (min chars, default 100), --quality-ratio F (min meaningful line ratio, default 0.3)

Step 2: Extract OCR page images (pdftoppm)

For each page in ocr_pages, extract as PNG:

pdftoppm -png -f 2 -l 2 "/path/to/doc.pdf" /tmp/doc_p2
pdftoppm -png -f 4 -l 4 "/path/to/doc.pdf" /tmp/doc_p4
pdftoppm -png -f 5 -l 5 "/path/to/doc.pdf" /tmp/doc_p5

Output files: /tmp/doc_p2-1.png, /tmp/doc_p4-1.png, /tmp/doc_p5-1.png

Step 3: OCR with Apple Vision (ocr.swift)

swift <skill>/scripts/ocr.swift 3 /tmp/doc_p2-1.png /tmp/doc_p4-1.png /tmp/doc_p5-1.png

Output:

[
  {"file": "/tmp/doc_p2-1.png", "text": "OCR result...", "error": null},
  {"file": "/tmp/doc_p4-1.png", "text": "OCR result...", "error": null}
]
  • First argument is max concurrency (3 recommended for M1 16GB)
  • Results sorted by filename
  • If error is not null, that page's OCR failed — skip or retry

Step 4: Merge & Embed (pdf_embed.py)

Merge the outputs from Step 1 and Step 3 into a single JSON file:

[
  {"page": 1, "text": "text from pdf_check", "source": "/path/to/doc.pdf"},
  {"page": 2, "text": "text from ocr.swift", "source": "/path/to/doc.pdf"},
  {"page": 3, "text": "text from pdf_check", "source": "/path/to/doc.pdf", "tables": [["h1","h2"],["v1","v2"]]}
]

How to merge:

  • text_pages from Step 1 → use page and text directly
  • ocr.swift results from Step 3 → extract page number from filename (e.g., doc_p2-1.png → page 2)
  • Skip entries where error is not null
  • tables field is optional — if you want table-aware search, extract tables with pdfplumber's page.extract_tables() and include them. pdf_embed.py will create separate chunks for each table.
  • source should be the original PDF path

Save as /tmp/merged.json, then embed:

python3 <skill>/scripts/pdf_embed.py /tmp/merged.json "my_document"

Output: ~/.cli-jaw/vectors/my_document.vectors.json

If the file already exists, it skips (PDF is static → no incremental update needed). Use --force to re-embed.

Step 5: Query (pdf_query.py)

Semantic search (requires 1 API call for query embedding):

python3 <skill>/scripts/pdf_query.py "5G 요금제 가격이 얼마야?" --file my_document

Keyword search (no API call, free):

python3 <skill>/scripts/pdf_query.py "NPV" --file my_document --grep

Search all documents (omit --file):

python3 <skill>/scripts/pdf_query.py "계약 해지 조건"

Adjust result count with --n:

python3 <skill>/scripts/pdf_query.py "질문" --file my_document --n 10

Storage

~/.cli-jaw/vectors/
├── air_프로모션.vectors.json
├── 계약서_2026.vectors.json
└── 기출문제_수학.vectors.json

Each .vectors.json contains embedded chunks with vectors, text, page numbers, and metadata. Embedding is done once; all subsequent searches are free (local cosine similarity). Only query embedding costs 1 API call.

Environment Variables

Variable Required Default Description
EMBEDDING_PROVIDER No gemini gemini, openai, or vertex
GEMINI_API_KEY When provider=gemini Google AI API key
OPENAI_API_KEY When provider=openai OpenAI API key
EMBEDDING_MODEL No Provider-dependent* Embedding model name
EMBEDDING_BATCH_SIZE No 50 Texts per API call
VERTEX_PROJECT When provider=vertex ADC auto-detect GCP project ID
VERTEX_LOCATION When provider=vertex us-central1 GCP region
EMBEDDING_DIM When provider=vertex 0 (full) Output dimensionality

*EMBEDDING_MODEL defaults:

  • gemini → gemini-embedding-001
  • openai → text-embedding-3-small
  • vertex → gemini-embedding-001

Dependencies

Package Required by Install
pdfplumber pdf_check.py pip install pdfplumber
google-auth _embedding.py (Vertex only) pip install google-auth
Pillow ocr_columns.py only pip install Pillow
Xcode CLT (Swift) ocr.swift Usually pre-installed on macOS
poppler (pdftoppm) Step 2 image extraction brew install poppler

Gemini and OpenAI providers use stdlib urllib only — no extra packages needed.

Optional: Column-based OCR (ocr_columns.py)

For images with column layouts (tables, multi-column documents):

python3 <skill>/scripts/ocr_columns.py image.png --cols 3 --header-px 100

Splits image into columns (and optional header), OCRs each part separately, returns combined result. Requires Pillow.

Flag Default Description
--cols N 5 Number of columns
--header-px N 0 Header height in pixels to extract separately
--output json json or text

Constraints

  • M1 concurrency: Do not exceed 4 for ocr.swift concurrency on M1 16GB. Recommended: 3.
  • OCR languages: ocr.swift is hardcoded to Korean (ko-KR) and English (en-US). Other languages may produce inaccurate results.
  • Cleanup: Delete temporary PNG files and merged JSON from /tmp/ after embedding.
Install via CLI
npx skills add https://github.com/lidge-jun/cli-jaw-skills --skill pdf-vision
Repository Details
star Stars 4
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator