name: edgeparse description: Extract structured content from any PDF for AI agents, RAG pipelines, and Copilot Skills. Use this skill whenever the user wants to read, analyze, or reason about a PDF document; needs to feed document content to an LLM; mentions PDF extraction, parsing, or conversion; wants tables, headings, or bounding boxes from a PDF; is building a RAG pipeline; or asks an agent to process a document. Install with: pip install edgeparse license: Apache-2.0 metadata: authors: "EdgeParse Contributors" version: "0.1.0" package: "edgeparse" install_python: "pip install edgeparse" install_node: "npm install edgeparse" source: "raphaelmansuy/edgeparse"
EdgeParse Skill
Enables AI agents to extract clean, structured content from any PDF — headings, tables, paragraphs, lists, bounding boxes — deterministically, without ML dependencies or GPU requirements.
Install: pip install edgeparse · Node.js: npm install edgeparse
Speed: ~0.023 s/doc (Apple M4 Max, 200-doc benchmark)
When to reach for this skill
Activate when the workflow involves:
- Reading or analyzing a PDF document on behalf of a user
- Building a RAG pipeline that ingests PDFs
- Feeding PDF content to an LLM for summarization, Q&A, or synthesis
- Extracting tables from financial reports, research papers, or invoices
- Processing a batch of documents for indexing or search
- An agent tool that must "open" a PDF and return its contents
Quick start
import edgeparse
# Convert any PDF to Markdown — best for LLM context windows
text = edgeparse.convert("report.pdf", format="markdown")
# Convert to JSON with bounding boxes and full structure
import json
doc = json.loads(edgeparse.convert("report.pdf", format="json"))
# Plain text (fast, minimal)
plain = edgeparse.convert("report.pdf", format="text")
The format parameter controls output:
| Value | Best for |
|---|---|
"markdown" |
LLM context — headings, tables, lists in Markdown |
"json" |
Bounding boxes, citations, structured element metadata |
"html" |
Web rendering, semantic HTML5 |
"text" |
Simple full-text search, minimal output |
Core API
edgeparse.convert()
result: str = edgeparse.convert(
input_path, # str or Path — required
format="markdown", # output format (see table above)
pages=None, # e.g. "1-5" or "1,3,7-10" — specific pages only
password=None, # for password-protected PDFs
reading_order="xycut", # "xycut" (spatial sort, default) or "off"
table_method="default", # "default" (ruling-line) or "cluster" (borderless)
image_output="off", # "off", "embedded" (base64), "external" (files)
)
Returns the extracted content as a string. Raises FileNotFoundError for missing files and ValueError for corrupt PDFs or bad options.
edgeparse.convert_file()
out_path: str = edgeparse.convert_file(
input_path,
output_dir="output", # write output file to this directory
format="markdown",
pages=None,
password=None,
)
Writes the output file and returns its path.
Common patterns
Feed a PDF to an LLM
import edgeparse
import anthropic
doc = edgeparse.convert("report.pdf", format="markdown")
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
messages=[{
"role": "user",
"content": f"Analyze this document and summarize the key findings:\n\n{doc}"
}]
)
print(response.content[0].text)
RAG pipeline — chunk with metadata
import edgeparse, json
raw = edgeparse.convert("paper.pdf", format="json")
doc = json.loads(raw)
chunks = []
for el in doc["elements"]:
if el["type"] in ("paragraph", "heading", "table"):
chunks.append({
"text": el["text"],
"metadata": {
"page": el["page_number"],
"type": el["type"],
"bbox": el["bounding_box"], # for citation highlights
"order": el["reading_order"],
}
})
# Now embed chunks["text"] and store chunks["metadata"] in your vector store
Batch processing
import edgeparse
from pathlib import Path
results = {}
for pdf in Path("documents/").glob("*.pdf"):
try:
results[pdf.name] = edgeparse.convert(str(pdf), format="markdown")
except Exception as e:
results[pdf.name] = f"ERROR: {e}"
Extract specific pages only
# Pages 1–5
text = edgeparse.convert("report.pdf", format="markdown", pages="1-5")
# Non-contiguous pages
text = edgeparse.convert("report.pdf", format="markdown", pages="1,3,7-10")
Borderless table extraction
Many financial reports and invoices use tables without ruling lines.
Use table_method="cluster" to handle them:
text = edgeparse.convert(
"earnings.pdf",
format="markdown",
table_method="cluster" # spatial clustering for borderless tables
)
Password-protected PDF
text = edgeparse.convert("secure.pdf", format="markdown", password="mypassword")
Node.js usage
import { convert } from 'edgeparse';
const markdown = convert('report.pdf', { format: 'markdown' });
const json = convert('report.pdf', { format: 'json' });
// With options
const result = convert('report.pdf', {
format: 'markdown',
pages: '1-5',
readingOrder: 'xycut',
tableMethod: 'cluster',
});
JSON output schema
When format="json", the output is a JSON string with shape:
{
"page_count": 10,
"title": "Document Title",
"elements": [
{
"type": "heading",
"level": 1,
"text": "Introduction",
"page_number": 1,
"reading_order": 0,
"bounding_box": { "x0": 72, "y0": 144, "x1": 540, "y1": 180 }
},
{
"type": "table",
"text": "| Col A | Col B |\n|-------|-------|\n| val1 | val2 |",
"page_number": 2,
"bounding_box": { "x0": 72, "y0": 200, "x1": 540, "y1": 350 }
},
{
"type": "paragraph",
"text": "This is body text...",
"page_number": 1,
"reading_order": 2,
"bounding_box": { "x0": 72, "y0": 190, "x1": 540, "y1": 220 }
}
]
}
Element type values: heading, paragraph, table, list, list_item, figure, caption, header, footer.
Error handling
import edgeparse
try:
text = edgeparse.convert("report.pdf", format="markdown")
except FileNotFoundError:
# PDF file not found — check the path
pass
except ValueError as e:
# Invalid format, corrupt PDF, wrong password, or bad page range
print(f"Extraction failed: {e}")
For more detail
Read these reference files when the SKILL.md body isn't enough:
references/api.md— complete Python + Node.js API with all parameters and typesreferences/patterns.md— LangChain, LlamaIndex, MCP tool, CrewAI, and async batch patterns