name-framework-migration-first-step

star 748

Migrate ad-hoc name cleaning in a crawler to h.review_names (Step 1 of the name framework migration). Use when a crawler.py contains delimiter splits, regex substitutions, bracket stripping, or conditional logic applied to name strings before the name is added or applied.

opensanctions By opensanctions schedule Updated 6/10/2026

name: name-framework-migration-first-step description: Migrate ad-hoc name cleaning in a crawler to h.review_names (Step 1 of the name framework migration). Use when a crawler.py contains delimiter splits, regex substitutions, bracket stripping, or conditional logic applied to name strings before the name is added or applied. argument-hint: "[crawler.py path]" disable-model-invocation: true

Perform Step 1 of the name framework migration in $ARGUMENTS: introduce h.review_names alongside the existing cleaning logic. Existing entity.add / h.apply_name calls remain in place and continue to drive output; reviews are not applied until Step 3 of the procedure.

Branch setup

Before making any changes:

  1. Derive a branch name from the crawler path by taking the dataset name (the directory containing crawler.py) and prefixing it with name-migration/. For example, datasets/us/ga/med_exclusions/crawler.pyname-migration/us-ga-med-exclusions.
  2. Create and check out the branch:
    git checkout -b <branch-name>
    
  3. Confirm you are on the new branch before proceeding.

Crawler source

!cat $ARGUMENTS

Read first (in order)

  • examples/migrations.md — real before/after migrations; read before touching any crawler
  • zavod/docs/extract/names.md#migrating-to-the-name-cleaning-helpers — full three-step procedure and rationale
  • zavod/zavod/helpers/names.py — exact signatures for review_names, check_names_regularity, Names

Trigger patterns to find

Scan the crawler for any of these before acting:

# Delimiter splits
last_name, name = name_raw.split(",", 1)
name, *aliases = h.multi_split(raw, SPLITS)

# Bracket/parenthesis stripping
name = name.replace("(Acting)", "")
name = name.strip("„")

# Regex substitutions or splits on name content
parts = re.split(r"(?i)\baka\b", name, maxsplit=1)
names = h.multi_split(name_raw, ["(w zapisie także", "(", ")"])

# Conditional checks on name content before apply_name
if len(name_split) > 1:
    entity.add("alias", name_split[1:])

Migration steps

  1. Capture the raw name string before any cleaning: original = h.Names(name=<raw>).
  2. Initialise suggested = h.Names().
  3. For each existing entity.add(name_prop, value) or h.apply_name(...) call, add a mirroring entry to suggested — see examples/migrations.md for the exact patterns.
  4. After all name-setting calls, add:
    is_irregular, suggested = h.check_names_regularity(entity, suggested)
    h.review_names(context, entity, original=original, suggested=suggested, is_irregular=is_irregular)
    
  5. For non-sanctions crawlers, pass llm_cleaning=True and omit suggested and is_irregular:
    h.review_names(context, entity, original=original, llm_cleaning=True)
    

After changes

After every edit to the crawler file, run:

uvx ruff check --fix $ARGUMENTS && uvx ruff format $ARGUMENTS

Fix any errors ruff reports before proceeding.

Once all changes are complete and ruff passes, stage the file:

git add $ARGUMENTS

Then output the suggested commit message (do not commit):

[<dataset_slug>] name migration

where <dataset_slug> is derived from the path by stripping datasets/ and /crawler.py and replacing / with _ (e.g. datasets/us/ga/med_exclusions/crawler.py[us_ga_med_exclusions] name migration).

Do not

  • Do not remove or modify any existing entity.add / h.apply_name calls
  • Do not pass a cleaned or intermediate string as original — always use the unmodified raw source string
  • Do not use llm_cleaning=True for sanctions crawlers
  • Do not proceed to Step 3 of the three-step migration procedure (switching to apply_reviewed_names) — that requires completed reviews first
  • Do not construct Names by guessing field names — read zavod/zavod/helpers/names.py first
  • Do not call h.review_names more than once per entity
  • Do not add explanatory comments beyond what the code requires
Install via CLI
npx skills add https://github.com/opensanctions/opensanctions --skill name-framework-migration-first-step
Repository Details
star Stars 748
call_split Forks 162
navigation Branch main
article Path SKILL.md
More from Creator
opensanctions
opensanctions Explore all skills →