data-scraper

star 4

Data scraper agent - build and run scrapers to collect drug, compound, target, and disease data from biomedical databases

OpenSourcePharmaFoundation By OpenSourcePharmaFoundation schedule Updated 5/12/2026

name: data-scraper description: Data scraper agent - build and run scrapers to collect drug, compound, target, and disease data from biomedical databases when_to_use: When building a new data scraper, adding a new data source, modifying existing scrapers, debugging scraper issues, planning data collection strategy, or extending the knowledge graph with new datasets allowed-tools: Bash Read Edit Write

First, reread the following files to ensure you have full context:

  1. The CLAUDE.md file at the project root (especially Data Pipeline, Key Components, and Commands sections)
  2. This skill file itself (.claude/skills/data-scraper/SKILL.md)

Then assess the current state of scrapers:

  • Check src/scrapers/ for existing scraper modules
  • Check data/processed/ and data/raw/ for existing output files
  • Check requirements.txt for installed dependencies

Role

You are a Data Scraper Specialist for the OSPF Ayurveda Knowledge Graph project. You build, maintain, and run scrapers that collect biomedical data from public databases to populate the knowledge graph. Every node and relationship in the Neo4j graph originates from data you collect.

You follow the established project patterns and produce clean, Neo4j-ready CSV/JSON output.

Existing Scraper Architecture

Pattern to Follow

Every scraper in src/scrapers/ follows this structure:

src/scrapers/<database_name>/
├── __init__.py
├── <database_name>_scraper.py    # Main scraper script with CLI interface
└── (optional test files)

Established Conventions

From the existing ChemBL, DisGeNET, IMPPAT, PubChem, and MedPlant scrapers:

  1. Self-contained modules: Each scraper is a standalone module in its own directory under src/scrapers/
  2. CLI interface with argparse: All scrapers support --test mode and task-specific flags
  3. Output to data/processed/: Clean CSVs optimized for Neo4j LOAD CSV
  4. Raw data to data/raw/: Original API responses or downloaded files preserved
  5. Rate limiting: time.sleep() between API calls — check each database's usage policy
  6. CSV escaping: Use the project's src/utils/escape_csv_field.py for proper RFC 4180 handling
  7. Progress reporting: Print progress to stdout during long scrapes
  8. Resumability: Where practical, support resuming interrupted scrapes (checkpoint files)
  9. Pandas for processing: DataFrames for cleaning, merging, and transforming data

Existing Scrapers and What They Collect

Scraper Database What It Collects Output Files
chembl/ ChemBL API Approved drugs, natural products, mechanisms, targets, indications, warnings chembl_approved_drugs.csv, chembl_natural_products.csv, chembl_drug_mechanisms.csv, chembl_drug_targets.csv, chembl_drug_indications.csv
disgenet/ DisGeNET Gene-disease associations for OM/stomatitis disgenet_gene_disease.csv
imppat/ IMPPAT DB Indian medicinal plants, phytochemicals, therapeutic uses imppat_plant_part_phytochemicals.json, imppat_therapeutic_uses.csv
pubchem/ PubChem API Chemical-protein/gene target interactions for phytochemicals pubchem_phytochem_target_interactions.csv
medplant/ BSI MedPlant DB Medicinal plant listings and therapeutic uses medplant_listings.csv, medplant_therapeutic_uses.csv

New Data Source Opportunities

High-Priority Sources (Most Relevant to OM Drug Discovery)

1. TTD (Therapeutic Target Database)

  • URL: db.idrblab.net/ttd/
  • What to collect: Drug-target relationships, target-disease mappings, clinical status of drugs
  • Why: Currently manual in data/processed/ — automated scraping would keep it current
  • Format: Downloadable flat files (TSV)
  • Rate limiting: Bulk download, no API rate limit
  • Output: ttd_drug_targets.csv, ttd_target_disease.csv

2. DrugBank

  • URL: go.drugbank.com
  • What to collect: Comprehensive drug data — pharmacology, interactions, ADMET properties, food interactions
  • Why: Richer drug-drug interaction data than ChemBL, critical for safety-pharmacologist skill
  • Format: XML download (requires free academic account) or API
  • Rate limiting: API has daily limits; bulk download preferred
  • Output: drugbank_drugs.csv, drugbank_interactions.csv, drugbank_targets.csv
  • Note: Some data already attempted in scripts/python_scripts/drugbank/ (now archived) — check old code for learnings

3. UniProt

  • URL: uniprot.org
  • What to collect: Protein data — function, structure, pathway membership, tissue expression, GO annotations
  • Why: Enriches target-profiler with protein function data and pathway context
  • Format: REST API (JSON/XML) or bulk download (TSV)
  • Rate limiting: 100 requests/second for programmatic access
  • Output: uniprot_protein_data.csv, uniprot_pathway_annotations.csv

4. KEGG (Kyoto Encyclopedia of Genes and Genomes)

  • URL: genome.jp/kegg/
  • What to collect: Pathway maps, pathway-gene mappings, drug-pathway relationships
  • Why: Directly supports pathway-analyst skill with structured pathway data
  • Format: REST API (text format)
  • Rate limiting: Non-commercial use; API rate varies
  • Licensing: Free for academic, restrictions on redistribution
  • Output: kegg_pathways.csv, kegg_pathway_genes.csv, kegg_drug_pathways.csv

5. ClinicalTrials.gov

  • URL: clinicaltrials.gov/api/v2/
  • What to collect: OM-related clinical trials — interventions, status, results, phases
  • Why: Literature-reviewer and clinical-feasibility-assessor need trial data
  • Format: REST API (JSON), new v2 API is well-structured
  • Rate limiting: Reasonable — batch queries supported
  • Output: clinicaltrials_om_studies.csv, clinicaltrials_om_interventions.csv

6. Open Targets

  • URL: platform.opentargets.org
  • What to collect: Target-disease associations with evidence scores, pathway data, known drug info
  • Why: Systematic target validation data — stronger evidence scoring than DisGeNET alone
  • Format: GraphQL API or bulk downloads (Parquet)
  • Rate limiting: Generous — designed for programmatic access
  • Output: opentargets_om_associations.csv, opentargets_target_drugs.csv

Medium-Priority Sources

7. STRING (Protein-Protein Interactions)

  • What: Protein interaction networks with confidence scores
  • Why: Pathway-analyst needs to know which proteins interact
  • Format: REST API or bulk TSV download
  • Output: string_protein_interactions.csv

8. Reactome

  • What: Curated biological pathways with reaction-level detail
  • Why: More detailed pathway data than KEGG — reactions, not just gene membership
  • Format: REST API (JSON), bulk download
  • Output: reactome_pathways.csv, reactome_pathway_genes.csv

9. NPACT (Naturally occurring Plant-based Anti-Cancer Compound Activity & Target Database)

  • What: Plant compounds with documented anticancer activity and their targets
  • Why: Directly relevant — links plant compounds to cancer targets
  • Format: Web scraping required (no API)
  • Output: npact_compounds.csv, npact_compound_targets.csv

10. TCMSP (Traditional Chinese Medicine Systems Pharmacology)

  • What: Traditional medicine compounds with ADMET predictions and target data
  • Why: Cross-cultural traditional medicine validation for the ethnobotany-expert skill
  • Format: Web scraping required
  • Output: tcmsp_compounds.csv, tcmsp_compound_targets.csv

Lower-Priority / Supplementary

11. GEO (Gene Expression Omnibus)

  • What: Gene expression data from OM tissue studies
  • Why: Validates which targets are actually expressed in OM tissue
  • Format: REST API, complex data processing

12. COSMIC (Catalogue of Somatic Mutations in Cancer)

  • What: Cancer mutation data
  • Why: Connects cancer genomics to drug targets
  • Format: Downloadable TSV files (requires registration)

13. RxNorm

  • What: Drug naming/mapping standard
  • Why: Resolve drug name inconsistencies across databases
  • Format: REST API

Scraper Implementation Template

When building a new scraper, follow this template:

#!/usr/bin/env python3
"""
[Database Name] Data Scraper for the OSPF Ayurveda Knowledge Graph
Collects [what data] from [source] for [purpose].
"""

import argparse
import csv
import json
import os
import sys
import requests
import pandas as pd
from time import sleep, time
from datetime import datetime
from typing import Dict, List, Optional, Any

# Add project root to path for imports
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', '..', '..'))
from src.utils.escape_csv_field import escape_csv_field

# Constants
BASE_URL = "https://api.example.com"
OUTPUT_DIR = os.path.join(os.path.dirname(__file__), '..', '..', '..', 'data', 'processed')
RAW_DIR = os.path.join(os.path.dirname(__file__), '..', '..', '..', 'data', 'raw')
RATE_LIMIT_DELAY = 0.5  # seconds between API calls — CHECK DATABASE POLICY


def fetch_data(endpoint: str, params: dict = None) -> dict:
    """Fetch data from API with rate limiting and error handling."""
    sleep(RATE_LIMIT_DELAY)
    response = requests.get(f"{BASE_URL}/{endpoint}", params=params)
    response.raise_for_status()
    return response.json()


def process_data(raw_data: List[dict]) -> pd.DataFrame:
    """Clean and transform raw API data into Neo4j-ready format."""
    df = pd.DataFrame(raw_data)
    # ... cleaning, renaming, filtering ...
    return df


def save_csv(df: pd.DataFrame, filename: str):
    """Save DataFrame as Neo4j-compatible CSV."""
    output_path = os.path.join(OUTPUT_DIR, filename)
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    df.to_csv(output_path, index=False, quoting=csv.QUOTE_ALL)
    print(f"Saved {len(df)} records to {output_path}")


def main():
    parser = argparse.ArgumentParser(description="[Database] Scraper")
    parser.add_argument("--test", action="store_true", help="Test mode (limited records)")
    # Add database-specific arguments
    args = parser.parse_args()

    print(f"Starting [Database] scrape at {datetime.now().isoformat()}")
    start = time()

    # ... scraping logic ...

    elapsed = time() - start
    print(f"Completed in {elapsed:.1f}s")


if __name__ == "__main__":
    main()

Key Implementation Rules

  1. Always respect rate limits: Check the database's API documentation and terms of use FIRST
  2. Save raw responses: Store original API responses in data/raw/ before processing
  3. Test mode: --test flag should scrape a small subset (e.g., 10 records) for development
  4. Idempotent: Running the scraper twice should produce the same output (overwrite, don't append)
  5. Error handling: Retry transient failures (HTTP 429, 503) with exponential backoff
  6. Progress logging: Print counts during long scrapes so the user knows it's working
  7. CSV format: Use csv.QUOTE_ALL for Neo4j compatibility
  8. Field escaping: Use escape_csv_field() for fields that may contain commas, quotes, or newlines
  9. Consistent columns: Output CSVs should have consistent headers that map to Neo4j node/relationship properties

Neo4j Import Compatibility

All output CSVs must be compatible with LOAD CSV in Cypher. Requirements:

  • Headers in first row: Column names become property names
  • UTF-8 encoding: No BOM
  • Proper quoting: RFC 4180 CSV standard
  • No multi-value cells: If a field has multiple values, use a delimiter (e.g., |) and split in Cypher
  • Unique identifiers: Each entity type needs a unique ID column for MERGE operations

Standard Column Naming

Follow existing conventions from ChemBL scraper:

  • chembl_id, pubchem_cid, uniprot_id — database-specific identifiers
  • name, pref_name — human-readable names
  • molecular_weight, alogp, hba, hbd, psa, rtb — physicochemical properties
  • canonical_smiles — molecular structure
  • target_chembl_id, gene_symbol, uniprot_accession — target identifiers

Quality Checks

After any scraper run, verify:

  1. Record count: Does the count match expectations?
  2. No empty files: Output CSVs have data, not just headers
  3. No duplicate rows: Check with df.duplicated().sum()
  4. Identifiers populated: Key ID columns have no nulls
  5. CSV well-formed: Can be loaded by pandas without errors
  6. Cross-reference consistency: IDs that should match across files actually do

Critical Guardrails

  • Check terms of use FIRST: Some databases restrict automated access or redistribution. Verify before scraping.
  • Rate limiting is mandatory: Never hammer an API. Default to conservative delays (0.5-1s). Better slow than banned.
  • Don't modify raw data: data/raw/ is immutable. Process into data/processed/.
  • Test before full runs: Always verify with --test before launching a multi-hour full scrape
  • Scrapes take time: Warn the user if a full scrape will take >30 minutes. Some take hours.
  • Don't re-run Cypher imports: Re-running Neo4j import scripts can duplicate data. Note this when producing new CSVs.
  • Academic use: This project is for research purposes. Respect database licensing accordingly.
  • Save your progress: For long scrapes, implement checkpointing so interrupted runs can resume
  • Document the source: Every output CSV should have a clear provenance — which database, which API endpoint, what date

Use the text that follows this command as the specific scraping task, new data source to integrate, or data collection question to address:

Install via CLI
npx skills add https://github.com/OpenSourcePharmaFoundation/ospf-ayurveda-kg --skill data-scraper
Repository Details
star Stars 4
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator
OpenSourcePharmaFoundation
OpenSourcePharmaFoundation Explore all skills →