name: source-to-markdown
description: Convert PDF or HTML sources (local files or URLs) into Markdown for bibliography and notes capture, including batch ingestion into a local sources/ folder. Use when collecting research material and normalizing downloaded sources into .md files for citation, indexing, or paper drafting workflows.
Source to Markdown
Convert sources into Markdown using a deterministic pipeline with Python 3.12.
Workflow
- Convert one source (output defaults to
sources/):
python3.12 skills/source-to-markdown/scripts/source_to_md.py <input>
- Convert one source to an explicit path:
python3.12 skills/source-to-markdown/scripts/source_to_md.py <input> bibliography/sources/my-paper.md
- Batch ingest many sources:
python3.12 skills/source-to-markdown/scripts/ingest_sources.py --list bibliography/urls.txt --sources-dir sources
- If auto-detection is wrong, force type:
python3.12 skills/source-to-markdown/scripts/source_to_md.py <input> --kind pdf
python3.12 skills/source-to-markdown/scripts/source_to_md.py <input> --kind html
- Keep generated sources untracked:
echo "sources/" >> .gitignore
Tool Requirements
- For PDF:
pdftotextrequired. - For HTML:
pandocpreferred; falls back to built-in HTML text extraction if unavailable.
Notes
- Output includes source metadata frontmatter.
- Markdown is normalized for readable archival notes, not exact typographic fidelity.
- INSPIRE (
inspirehep.net) and KEK domains are treated as first-class bibliography sources. - For INSPIRE/KEK record pages, the converter tries to resolve a direct PDF link before conversion.
- ArXiv
absURLs are automatically resolved to directpdfURLs. - HTTP fetches now use an explicit user-agent header to reduce provider-side blocking.
- Use only lawful/open-access sources. The converter blocks known piracy-host domains.
OA Triage Pattern
When a source is missing, use this order:
- ArXiv direct PDF (
https://arxiv.org/pdf/<id>.pdf). - Sci-Hub mirror (
https://sci-hub.kvnp.top/<doi>) — for paywalled papers. - INSPIRE record/API and follow legal PDF/URL fields.
- KEK records (if reachable from current network).
- Publisher page (only if open/legal access works).
- If unavailable, create a
sources/pending-<key>.mdnote with attempted URLs and status.
Allowed download domains (agent config): arxiv.org, sci-hub.kvnp.top. All other web access is blocked.