name: typechecker-fixes description: Fix mypy --strict type errors in crawler files. Use when the user asks to make the typechecker happy, fix types, or add type annotations to a crawler. argument-hint: "[crawler.py path]"
Typechecker Fixes for Crawlers
Fix mypy strict-mode errors in opensanctions crawler files. Run mypy --strict --explicit-package-bases on the target directory to find errors, then apply the patterns below.
Read first
zavod/zavod/helpers/html.py— typed HTML helpers (xpath_strings,xpath_elements,xpath_element,element_text)zavod/zavod/util.py—ElementandElementOrTreetype aliases
Workflow
- Run
mypy --strict <crawler.py>to see current errors - Apply fixes using the patterns below
- Run
mypy --strict <crawler.py>again to verify errors resolved
Patterns (most common first)
1. Add -> None return type to all functions that don't return a value
This is the single most common fix. Every crawl(), crawl_row(), crawl_item(), parse_*(), and helper function that doesn't return needs -> None.
# Before
def crawl(context: Context):
def crawl_row(context: Context, row: dict):
def apply_identifier(context: Context, entity: Entity, id_number_line: str):
# After — pick the narrowest value type the row actually contains (see pattern #3)
def crawl(context: Context) -> None:
def crawl_row(context: Context, row: dict[str, str | None]) -> None:
def apply_identifier(context: Context, entity: Entity, id_number_line: str) -> None:
2. Add type annotations to all untyped parameters
# Before
def crawl_item(input_html, context: Context):
def crawl_term(context, link: HtmlElement, ...):
# After
def crawl_item(input_html: Element, context: Context) -> None:
def crawl_term(context: Context, link: HtmlElement, ...) -> None:
3. Replace bare dict with the narrowest value type the code actually uses
Pick the value type by looking at every assignment into the dict. Use Any only as a last resort. In order of preference:
dict[str, str]— all values are strings (e.g.h.xpath_strings(...)[0], string literals,city.get("key", "")).dict[str, str | None]— some values can legitimately beNone(e.g.element.text,element.get("attr"),record.get(key)with no default). At consumer sites that needstr, narrow with a local +assert value is not Noneor relax the consumer's signature to acceptNone.dict[str, Any]— only when values are genuinely heterogeneous (mixed types that can't be expressed as a simple union, e.g.str,int, nestedlist/dict). Prefer aTypedDictif the dict has a fixed schema.
# Before
def crawl_item(input_dict: dict, context: Context):
json_data = { ... }
# After — values are all strings
def crawl_item(input_dict: dict[str, str], context: Context) -> None:
json_data: dict[str, str] = { ... }
# After — values include str | None from element.text
record: dict[str, str | None] = {}
record["name"] = row[0].text # str | None
record["url"] = urljoin(base, row[0].get("href")) # str
# After — truly heterogeneous (resort to Any)
from typing import Any
item: dict[str, Any] = {"name": "x", "count": 3, "tags": [...]}
When you tighten to dict[str, str | None], expect two or three follow-up errors at call sites that expect strict str (e.g. context.fetch_html). Handle each by either:
- Pulling the value into a local and asserting non-None:
url = record["url"]; assert url is not None. - Widening the consumer's signature if
Noneis a semantically valid input (e.g.is_valid(regno: str | None)returningFalseforNone).
Use lowercase dict, list, set, tuple — not the deprecated Dict, List, Set, Tuple from typing. While fixing types, also migrate any existing typing.Dict etc. to builtins.
4. Replace lxml .xpath(), .find() and .findall() calls with typed h.xpath_* helpers
The raw lxml .xpath() returns Any. Use the zavod helpers instead:
# Before — returns Any
links = doc.xpath(".//a/@href")
elements = doc.xpath('.//div[@class="item"]')
text = doc.xpath(".//h1/text()")[0]
# After — properly typed
links = h.xpath_strings(doc, ".//a/@href")
elements = h.xpath_elements(doc, './/div[@class="item"]')
text = h.xpath_string(doc, ".//h1/text()")
When iterating over elements only to extract an attribute (e.g. .get("href")), move the attribute into the xpath and use h.xpath_strings instead:
# Before
for anchor in doc.xpath('//a[contains(@class, "name")]'):
url = anchor.get("href")
crawl_page(context, url)
# Also before (already migrated to xpath_elements but still using .get)
for anchor in h.xpath_elements(doc, '//a[contains(@class, "name")]'):
url = anchor.get("href")
crawl_page(context, url)
# After
for url in h.xpath_strings(doc, '//a[contains(@class, "name")]/@href'):
crawl_page(context, url)
Use h.xpath_element() (singular) when you expect exactly one match:
# Before
divs = doc.xpath(divs_xpath)
assert len(divs) == 1
content = divs[0]
# After
content = h.xpath_element(doc, divs_xpath)
5. Replace .text_content() with h.element_text()
h.element_text() calls text_content() internally and applies collapse_spaces + strip. If the original code was calling squash_spaces or collapse_spaces on the result, that's now redundant and should be removed. If the extracted text is used for lookups (check the lookups: section in the crawler's .yml file) or exact comparisons, pass squash=False to preserve the original whitespace.
Watch out for code that splits the text on \n (or otherwise relies on line breaks) — the default squash=True collapses newlines into spaces, which silently breaks the splitting and merges all rows into one. Always pass squash=False in that case.
# Before
name_info = summary.text.strip()
body = body_els[0].text_content().strip()
category = squash_spaces(row.pop("category").text_content())
# After
name_info = h.element_text(summary)
body = h.element_text(body_els[0])
category = h.element_text(row.pop("category"))
# When exact text matters (used in lookups or comparisons):
label = h.element_text(el, squash=False)
6. Use from zavod.util import Element, ElementOrTree for lxml type annotations
Don't use lxml.etree._Element (private API) or xml.etree.ElementTree. Use the re-exported types:
# Before
from lxml import etree
def parse_record(context: Context, el: etree._Element):
# After
from zavod.util import Element
def parse_record(context: Context, el: Element) -> None:
11. Use Iterator instead of Generator when only yielding
# Before
from typing import Generator
def parse_csv(context: Context, path: str) -> Generator[Item, None, None]:
# After
from typing import Iterator
def parse_csv(context: Context, path: str) -> Iterator[Item]:
12. Add return types to small helper functions
# Before
def clean_address(text):
def extract_passport_no(text):
# After
def clean_address(text: str | None) -> list[str] | None:
def extract_passport_no(text: str | None) -> list[str] | None:
13. Use keyword-only arguments for complex function signatures
When a function has many parameters, add * to force keyword arguments — this catches argument-order bugs at the call site:
# Before
def emit_linked_org(context, vessel_id, names, role, date):
...
emit_linked_org(context, vessel.id, related_ros, "Related Recognised Organization", start_date)
# After
def emit_linked_org(context: Context, *, vessel_id: str | None, names: str, role: str, date: str | None) -> None:
...
emit_linked_org(context, vessel_id=vessel.id, names=related_ros, role="Related Recognised Organization", date=start_date)
Add a comment to functions that return tuples
If the function returns a tuple, add a docstring comment to briefly describe the contents of the tuple. This is not a typechecker fix, but it helps readability since tuples don't have named fields.
General principles
- Never change logic. These are type-annotation-only fixes. Do not change control flow, data transformations, or output. Each fix must be resolved by exactly and narrowly applying one of the rules above. If a type error cannot be resolved that way without altering behavior, leave the error. It is fine to leave some errors unfixed rather than risk changing what the crawler emits.
- Prefer the narrowest correct type. For dicts, follow the preference order in pattern #3:
dict[str, str]>dict[str, str | None]>dict[str, Any]. Only fall back toAnywhen the values are genuinely heterogeneous. - Use
str | Noneunion syntax, notOptional[str], and lowercasedict/list/setnotDict/List/Set— but only when you're already editing the line for another reason. Do not make cosmetic-only changes to lines that have no type errors. - The
context: Contextparameter should always be first in crawler functions.