name: pdf-vision description: > PDF RAG pipeline: text extraction + Apple Vision OCR + API embedding + cosine search. Triggers: "pdf 읽기", "pdf rag", "문서 검색", "pdf 분석", "pdf 청크", "pdf-expert", "pdf 임베딩"
PDF Expert Skill
When to use
- Large PDF that won't fit in context (100+ pages)
- Scanned/image PDF that needs OCR
- Semantic search across one or multiple PDFs
- Building a personal document knowledge base (NotebookLM-style)
Quick Reference
| Task | Command |
|---|---|
| Check text vs scan pages | python3 <skill>/scripts/pdf_check.py "file.pdf" |
| Check with custom threshold | python3 <skill>/scripts/pdf_check.py "file.pdf" --threshold 50 --quality-ratio 0.2 |
| Extract scan pages as PNG | pdftoppm -png -f 3 -l 3 "file.pdf" /tmp/page_3 |
| OCR images (parallel) | swift <skill>/scripts/ocr.swift 3 /tmp/page_*.png |
| Embed → JSON vectors | python3 <skill>/scripts/pdf_embed.py merged.json "doc_name" |
| Re-embed (overwrite) | python3 <skill>/scripts/pdf_embed.py merged.json "doc_name" --force |
| Semantic search (1 doc) | python3 <skill>/scripts/pdf_query.py "질문" --file doc_name |
| Semantic search (all docs) | python3 <skill>/scripts/pdf_query.py "질문" |
| Keyword search (free) | python3 <skill>/scripts/pdf_query.py "키워드" --file doc_name --grep |
| Column-based OCR (optional) | python3 <skill>/scripts/ocr_columns.py image.png --cols 3 |
Replace <skill> with the absolute path to this skill folder.
Workflow
Step 1: Assess PDF (pdf_check.py)
python3 <skill>/scripts/pdf_check.py "/path/to/doc.pdf"
Output:
{
"file": "/absolute/path/to/doc.pdf",
"total_pages": 10,
"text_pages": [{"page": 1, "text": "..."}, {"page": 3, "text": "..."}],
"ocr_pages": [2, 4, 5]
}
text_pages: pages with sufficient extractable textocr_pages: page numbers that need OCR- Optional flags:
--threshold N(min chars, default 100),--quality-ratio F(min meaningful line ratio, default 0.3)
Step 2: Extract OCR page images (pdftoppm)
For each page in ocr_pages, extract as PNG:
pdftoppm -png -f 2 -l 2 "/path/to/doc.pdf" /tmp/doc_p2
pdftoppm -png -f 4 -l 4 "/path/to/doc.pdf" /tmp/doc_p4
pdftoppm -png -f 5 -l 5 "/path/to/doc.pdf" /tmp/doc_p5
Output files: /tmp/doc_p2-1.png, /tmp/doc_p4-1.png, /tmp/doc_p5-1.png
Step 3: OCR with Apple Vision (ocr.swift)
swift <skill>/scripts/ocr.swift 3 /tmp/doc_p2-1.png /tmp/doc_p4-1.png /tmp/doc_p5-1.png
Output:
[
{"file": "/tmp/doc_p2-1.png", "text": "OCR result...", "error": null},
{"file": "/tmp/doc_p4-1.png", "text": "OCR result...", "error": null}
]
- First argument is max concurrency (3 recommended for M1 16GB)
- Results sorted by filename
- If
erroris not null, that page's OCR failed — skip or retry
Step 4: Merge & Embed (pdf_embed.py)
Merge the outputs from Step 1 and Step 3 into a single JSON file:
[
{"page": 1, "text": "text from pdf_check", "source": "/path/to/doc.pdf"},
{"page": 2, "text": "text from ocr.swift", "source": "/path/to/doc.pdf"},
{"page": 3, "text": "text from pdf_check", "source": "/path/to/doc.pdf", "tables": [["h1","h2"],["v1","v2"]]}
]
How to merge:
text_pagesfrom Step 1 → usepageandtextdirectlyocr.swiftresults from Step 3 → extract page number from filename (e.g.,doc_p2-1.png→ page 2)- Skip entries where
erroris not null tablesfield is optional — if you want table-aware search, extract tables withpdfplumber'spage.extract_tables()and include them.pdf_embed.pywill create separate chunks for each table.sourceshould be the original PDF path
Save as /tmp/merged.json, then embed:
python3 <skill>/scripts/pdf_embed.py /tmp/merged.json "my_document"
Output: ~/.cli-jaw/vectors/my_document.vectors.json
If the file already exists, it skips (PDF is static → no incremental update needed). Use --force to re-embed.
Step 5: Query (pdf_query.py)
Semantic search (requires 1 API call for query embedding):
python3 <skill>/scripts/pdf_query.py "5G 요금제 가격이 얼마야?" --file my_document
Keyword search (no API call, free):
python3 <skill>/scripts/pdf_query.py "NPV" --file my_document --grep
Search all documents (omit --file):
python3 <skill>/scripts/pdf_query.py "계약 해지 조건"
Adjust result count with --n:
python3 <skill>/scripts/pdf_query.py "질문" --file my_document --n 10
Storage
~/.cli-jaw/vectors/
├── air_프로모션.vectors.json
├── 계약서_2026.vectors.json
└── 기출문제_수학.vectors.json
Each .vectors.json contains embedded chunks with vectors, text, page numbers, and metadata. Embedding is done once; all subsequent searches are free (local cosine similarity). Only query embedding costs 1 API call.
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
EMBEDDING_PROVIDER |
No | gemini |
gemini, openai, or vertex |
GEMINI_API_KEY |
When provider=gemini | — | Google AI API key |
OPENAI_API_KEY |
When provider=openai | — | OpenAI API key |
EMBEDDING_MODEL |
No | Provider-dependent* | Embedding model name |
EMBEDDING_BATCH_SIZE |
No | 50 |
Texts per API call |
VERTEX_PROJECT |
When provider=vertex | ADC auto-detect | GCP project ID |
VERTEX_LOCATION |
When provider=vertex | us-central1 |
GCP region |
EMBEDDING_DIM |
When provider=vertex | 0 (full) |
Output dimensionality |
*EMBEDDING_MODEL defaults:
- gemini →
gemini-embedding-001 - openai →
text-embedding-3-small - vertex →
gemini-embedding-001
Dependencies
| Package | Required by | Install |
|---|---|---|
pdfplumber |
pdf_check.py | pip install pdfplumber |
google-auth |
_embedding.py (Vertex only) | pip install google-auth |
Pillow |
ocr_columns.py only | pip install Pillow |
| Xcode CLT (Swift) | ocr.swift | Usually pre-installed on macOS |
poppler (pdftoppm) |
Step 2 image extraction | brew install poppler |
Gemini and OpenAI providers use stdlib urllib only — no extra packages needed.
Optional: Column-based OCR (ocr_columns.py)
For images with column layouts (tables, multi-column documents):
python3 <skill>/scripts/ocr_columns.py image.png --cols 3 --header-px 100
Splits image into columns (and optional header), OCRs each part separately, returns combined result. Requires Pillow.
| Flag | Default | Description |
|---|---|---|
--cols N |
5 | Number of columns |
--header-px N |
0 | Header height in pixels to extract separately |
--output |
json | json or text |
Constraints
- M1 concurrency: Do not exceed 4 for
ocr.swiftconcurrency on M1 16GB. Recommended: 3. - OCR languages:
ocr.swiftis hardcoded to Korean (ko-KR) and English (en-US). Other languages may produce inaccurate results. - Cleanup: Delete temporary PNG files and merged JSON from
/tmp/after embedding.