source-to-markdown

star 0

Convert PDF or HTML sources (local files or URLs) into Markdown for bibliography and notes capture, including batch ingestion into a local `sources/` folder. Use when collecting research material and normalizing downloaded sources into `.md` files for citation, indexing, or paper drafting workflows.

arivero By arivero schedule Updated 2/15/2026

name: source-to-markdown description: Convert PDF or HTML sources (local files or URLs) into Markdown for bibliography and notes capture, including batch ingestion into a local sources/ folder. Use when collecting research material and normalizing downloaded sources into .md files for citation, indexing, or paper drafting workflows.

Source to Markdown

Convert sources into Markdown using a deterministic pipeline with Python 3.12.

Workflow

  1. Convert one source (output defaults to sources/):
python3.12 skills/source-to-markdown/scripts/source_to_md.py <input>
  1. Convert one source to an explicit path:
python3.12 skills/source-to-markdown/scripts/source_to_md.py <input> bibliography/sources/my-paper.md
  1. Batch ingest many sources:
python3.12 skills/source-to-markdown/scripts/ingest_sources.py --list bibliography/urls.txt --sources-dir sources
  1. If auto-detection is wrong, force type:
python3.12 skills/source-to-markdown/scripts/source_to_md.py <input> --kind pdf
python3.12 skills/source-to-markdown/scripts/source_to_md.py <input> --kind html
  1. Keep generated sources untracked:
echo "sources/" >> .gitignore

Tool Requirements

  • For PDF: pdftotext required.
  • For HTML: pandoc preferred; falls back to built-in HTML text extraction if unavailable.

Notes

  • Output includes source metadata frontmatter.
  • Markdown is normalized for readable archival notes, not exact typographic fidelity.
  • INSPIRE (inspirehep.net) and KEK domains are treated as first-class bibliography sources.
  • For INSPIRE/KEK record pages, the converter tries to resolve a direct PDF link before conversion.
  • ArXiv abs URLs are automatically resolved to direct pdf URLs.
  • HTTP fetches now use an explicit user-agent header to reduce provider-side blocking.
  • Use only lawful/open-access sources. The converter blocks known piracy-host domains.

OA Triage Pattern

When a source is missing, use this order:

  1. ArXiv direct PDF (https://arxiv.org/pdf/<id>.pdf).
  2. Sci-Hub mirror (https://sci-hub.kvnp.top/<doi>) — for paywalled papers.
  3. INSPIRE record/API and follow legal PDF/URL fields.
  4. KEK records (if reachable from current network).
  5. Publisher page (only if open/legal access works).
  6. If unavailable, create a sources/pending-<key>.md note with attempted URLs and status.

Allowed download domains (agent config): arxiv.org, sci-hub.kvnp.top. All other web access is blocked.

Install via CLI
npx skills add https://github.com/arivero/firstProof --skill source-to-markdown
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator