debug-crawler - SKILL.md Agent Skill

name: debug-crawler description: Investigate a failing crawler from an issues.json artifact URL and propose a fix. Covers fetching error details, inspecting source data via Zyte, and common failure patterns including sources that are blocked, geo-blocked, 403/429-throttled, or behind a JavaScript challenge or anti-bot protection. argument-hint: "<issues.json URL>" allowed-tools: Read, Edit, Glob, Grep, Bash, WebFetch

Debug a Failing Crawler

The user has provided an issues.json artifact URL: $ARGUMENTS

Read zavod/docs as needed to understand how crawlers are normally written — the goal here is to fix the failing crawler in accordance with existing practices, not to refactor or standardise it.

Step 1: Fetch the issues

Fetch the issues.json URL to understand the error:

WebFetch <issues.json URL>
prompt: "Show all issues, especially errors and warnings. Include full message text and any data fields."

Note the:

Dataset name (e.g. us_ne_med_exclusions)
Error message and traceback
The row data if an assertion failed — the keys are slugified column names, values are cell contents

Step 2: Find the crawler

# Glob datasets/**/<dataset_name>.yml

Read the crawler's .yml and crawler.py.

Step 3: Inspect the current source data

The source has likely changed. Use OPENSANCTIONS_ZYTE_API_KEY (already set in the environment) to fetch via Zyte when direct access times out or is blocked:

python3 -c "
import requests, os
from base64 import b64decode

ZYTE_API_KEY = os.environ['OPENSANCTIONS_ZYTE_API_KEY']
url = '<data_url from .yml>'

resp = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=(ZYTE_API_KEY, ''),
    json={'url': url, 'httpResponseBody': True, 'httpResponseHeaders': True},
    timeout=60
)
resp.raise_for_status()
content = b64decode(resp.json()['httpResponseBody'])
# then parse content as appropriate for the source format
"

Add 'geolocation': 'US' (or the relevant country code) to the Zyte request when the source geo-restricts access — and add the matching geolocation= argument to the fetch_resource / fetch_html call in the crawler.

If the fix is to move the crawler onto Zyte (the source is now blocked, geo-blocked, throttled, or behind a JavaScript challenge), see zavod/docs/best_practices/http_operations.md for choosing the right helper (fetch_html for browser rendering, fetch_text / fetch_json / fetch_resource otherwise) and remember to set ci_test: false on the dataset.

Step 4: Diagnose

Compare what the source actually contains against what the crawler expects.

Common failures

Symptom	Cause	Fix
Expected field/column not found	Source renamed or restructured columns	Update the crawler to match the new structure
First page parses fine, later pages fail	Per-page header handling no longer matches source	Adjust header-reading logic to match current source
403 / empty response from Zyte	Source geo-restricts content	Add `geolocation=` to the fetch call
Assertion on entity count fails	Source grew or shrank	Verify the count is real, then update `assertions:` bounds
Unexpected keys in `audit_data`	New columns added to source	Pop and handle (or explicitly ignore) the new fields

Step 5: Fix and verify

After making code changes, delete the cached source file so the fresh copy is fetched:

rm -f data/datasets/<dataset_name>/source.*

zavod crawl datasets/<path>/<dataset_name>.yml

Check data/datasets/<dataset_name>/issues.log for remaining warnings. Then export and confirm the delta is plausible:

zavod export --rebuild-store datasets/<path>/<dataset_name>.yml

A healthy run shows:

No errors in the crawl log
Delta (added/deleted/modified) consistent with elapsed time since the last run
Entity counts within the assertions: bounds in the .yml