source-to-markdown - SKILL.md Agent Skill

name: source-to-markdown description: Convert PDF or HTML sources (local files or URLs) into Markdown for bibliography and notes capture, including batch ingestion into a local `sources/` folder. Use when collecting research material and normalizing downloaded sources into `.md` files for citation, indexing, or paper drafting workflows.

Source to Markdown

Convert sources into Markdown using a deterministic pipeline with Python 3.12.

Workflow

Convert one source (output defaults to sources/):

python3.12 skills/source-to-markdown/scripts/source_to_md.py <input>

Convert one source to an explicit path:

python3.12 skills/source-to-markdown/scripts/source_to_md.py <input> bibliography/sources/my-paper.md

Batch ingest many sources:

python3.12 skills/source-to-markdown/scripts/ingest_sources.py --list bibliography/urls.txt --sources-dir sources

If auto-detection is wrong, force type:

python3.12 skills/source-to-markdown/scripts/source_to_md.py <input> --kind pdf
python3.12 skills/source-to-markdown/scripts/source_to_md.py <input> --kind html

Keep generated sources untracked:

echo "sources/" >> .gitignore

Tool Requirements

For PDF: pdftotext required.
For HTML: pandoc preferred; falls back to built-in HTML text extraction if unavailable.

Notes

Output includes source metadata frontmatter.
Markdown is normalized for readable archival notes, not exact typographic fidelity.
INSPIRE (inspirehep.net) and KEK domains are treated as first-class bibliography sources.
For INSPIRE/KEK record pages, the converter tries to resolve a direct PDF link before conversion.
ArXiv abs URLs are automatically resolved to direct pdf URLs.
HTTP fetches now use an explicit user-agent header to reduce provider-side blocking.
Use only lawful/open-access sources. The converter blocks known piracy-host domains.

OA Triage Pattern

When a source is missing, use this order:

ArXiv direct PDF (https://arxiv.org/pdf/<id>.pdf).
Sci-Hub mirror (https://sci-hub.kvnp.top/<doi>) — for paywalled papers.
INSPIRE record/API and follow legal PDF/URL fields.
KEK records (if reachable from current network).
Publisher page (only if open/legal access works).
If unavailable, create a sources/pending-<key>.md note with attempted URLs and status.

Allowed download domains (agent config): arxiv.org, sci-hub.kvnp.top. All other web access is blocked.