minerals-web-ingest - SKILL.md Agent Skill

name: minerals-web-ingest description: Ingest and normalize web pages for critical-minerals intelligence, with optional Firecrawl fetching, deduplication manifest, and JSONL export metadata: openclaw: emoji: "🧾" requires: bins: - python3 env: - FIRECRAWL_API_KEY

Minerals Web Ingest

Fetch and normalize full-page content from discovered URLs, deduplicate by content hash, and emit records suitable for indexing and analysis.

Usage

# Ingest URLs from monitor output
python3 {baseDir}/scripts/web_ingest.py \
  --input-json monitor_records.json \
  --output-jsonl ingested_records.jsonl \
  --format summary

# Direct URL ingest
python3 {baseDir}/scripts/web_ingest.py \
  --url https://www.energy.gov/articles/example \
  --url https://www.usgs.gov/news/example \
  --format json

# Prefer Firecrawl (if FIRECRAWL_API_KEY is set)
python3 {baseDir}/scripts/web_ingest.py --input-json gov_records.json --prefer-firecrawl

Parameters

Parameter	Description	Default
`--input-json`	JSON input file with URL records	-
`--url`	Direct URL input (repeatable)	-
`--output-jsonl`	Optional JSONL output path	-
`--manifest-path`	URL->content hash dedupe manifest	`~/.scienceclaw/minerals_web_ingest_manifest.json`
`--timeout`	HTTP timeout seconds	30
`--max-chars`	Max chars stored per page	12000
`--prefer-firecrawl`	Use Firecrawl first, fallback to requests	false
`--format`	summary, detailed, json	summary

Output Schema

Ingested records include: url, source, published_at, title, summary, commodity_tags, country_tags, policy_signal, confidence, retrieved_at, source_type, content, content_hash.

Notes

If Firecrawl is unavailable or fails, script falls back to requests + BeautifulSoup extraction.
Manifest-based dedupe prevents reprocessing unchanged pages.
Use this output as input to corpus indexing or profile-generation pipelines.