minerals-web-ingest

star 222

Ingest and normalize web pages for critical-minerals intelligence, with optional Firecrawl fetching, deduplication manifest, and JSONL export

lamm-mit By lamm-mit schedule Updated 2/22/2026

name: minerals-web-ingest description: Ingest and normalize web pages for critical-minerals intelligence, with optional Firecrawl fetching, deduplication manifest, and JSONL export metadata: openclaw: emoji: "🧾" requires: bins: - python3 env: - FIRECRAWL_API_KEY


Minerals Web Ingest

Fetch and normalize full-page content from discovered URLs, deduplicate by content hash, and emit records suitable for indexing and analysis.

Usage

# Ingest URLs from monitor output
python3 {baseDir}/scripts/web_ingest.py \
  --input-json monitor_records.json \
  --output-jsonl ingested_records.jsonl \
  --format summary

# Direct URL ingest
python3 {baseDir}/scripts/web_ingest.py \
  --url https://www.energy.gov/articles/example \
  --url https://www.usgs.gov/news/example \
  --format json

# Prefer Firecrawl (if FIRECRAWL_API_KEY is set)
python3 {baseDir}/scripts/web_ingest.py --input-json gov_records.json --prefer-firecrawl

Parameters

Parameter Description Default
--input-json JSON input file with URL records -
--url Direct URL input (repeatable) -
--output-jsonl Optional JSONL output path -
--manifest-path URL->content hash dedupe manifest ~/.scienceclaw/minerals_web_ingest_manifest.json
--timeout HTTP timeout seconds 30
--max-chars Max chars stored per page 12000
--prefer-firecrawl Use Firecrawl first, fallback to requests false
--format summary, detailed, json summary

Output Schema

Ingested records include: url, source, published_at, title, summary, commodity_tags, country_tags, policy_signal, confidence, retrieved_at, source_type, content, content_hash.

Notes

  • If Firecrawl is unavailable or fails, script falls back to requests + BeautifulSoup extraction.
  • Manifest-based dedupe prevents reprocessing unchanged pages.
  • Use this output as input to corpus indexing or profile-generation pipelines.
Install via CLI
npx skills add https://github.com/lamm-mit/scienceclaw --skill minerals-web-ingest
Repository Details
star Stars 222
call_split Forks 41
navigation Branch main
article Path SKILL.md
More from Creator