name: data-join
description: Join two datasets with automatic strategy selection (joinp vs join vs sqlp)
user-invocable: true
argument-hint: " "
allowed-tools: [mcp__qsv__qsv_sniff, mcp__qsv__qsv_count, mcp__qsv__qsv_headers, mcp__qsv__qsv_index, mcp__qsv__qsv_stats, mcp__qsv__qsv_select, mcp__qsv__qsv_sqlp, mcp__qsv__qsv_joinp, mcp__qsv__qsv_command, mcp__qsv__qsv_list_files, mcp__qsv__qsv_search_tools, mcp__qsv__qsv_get_working_dir, mcp__qsv__qsv_set_working_dir]
Data Join
Join two tabular data files on common columns.
Cowork note: If relative paths don't resolve, call
mcp__qsv__qsv_get_working_dirandmcp__qsv__qsv_set_working_dirto sync the working directory.
Strategy Selection
| Scenario | Best Tool | Why |
|---|---|---|
| Standard equi-join | mcp__qsv__qsv_joinp |
Polars engine, fastest |
| Non-equi join (>, <, BETWEEN) | mcp__qsv__qsv_sqlp |
SQL supports complex conditions |
| Cross join / cartesian | mcp__qsv__qsv_sqlp |
CROSS JOIN syntax |
| Memory-constrained | mcp__qsv__qsv_command with command: "join" |
Streaming, lower memory |
| Fuzzy/approximate match | mcp__qsv__qsv_joinp with asof: true |
Nearest-match join |
Steps
Index both files: Run
mcp__qsv__qsv_indexon both files for fast random access.Inspect both files: Run
mcp__qsv__qsv_headerson both files to identify column names. Determine which columns to join on.Profile join columns: Run
mcp__qsv__qsv_statswithcardinality: true, stats_jsonl: trueon both files. Check the cardinality of join columns to determine optimal table order.Choose strategy:
- If cardinality of join column in file1 > file2, put file1 on the left
- For
joinp: smaller cardinality table should be on the right for best performance - If join condition is complex (non-equi), use
mcp__qsv__qsv_sqlp - If join involves date/time matching where exact dates won't align (e.g., quarterly to monthly, event dates to nearest reporting period), use
mcp__qsv__qsv_joinpwithasof: true
Execute join: Use
mcp__qsv__qsv_joinpfor standard joins:joinp columns1: "id" input1: "file1.csv" columns2: "id" input2: "file2.csv" # Join type: omit for inner (default), or set one of: # left: true, full: true, cross: trueOr use
mcp__qsv__qsv_sqlpfor complex joins:SELECT a.*, b.col1, b.col2 FROM file1 a JOIN file2 b ON a.id = b.id AND a.date BETWEEN b.start_date AND b.end_dateFor ASOF (nearest-match) joins, use
mcp__qsv__qsv_joinpwithasof: true:joinp columns1: "date" input1: "events.csv" columns2: "date" input2: "reference.csv" asof: true strategy: "backward" allow_exact_matches: truestrategy: "backward"(default) — match to the last right row with key < left keystrategy: "forward"— match to the first right row with key > left keystrategy: "nearest"— match to the numerically closest row (supportstoleranceparameter)- Add
left_by/right_byparameters to restrict matching within subgroups (e.g., per jurisdiction) - Add
allow_exact_matches: trueto include equal keys (<=, >=); default is strict inequality (<, >)
Clean up result: Use
mcp__qsv__qsv_selectto remove duplicate join columns or unnecessary columns from the result.Verify: Run
mcp__qsv__qsv_counton the result. Compare with input counts to validate join behavior:- Inner join: result <= min(left, right)
- Left join: result >= left count
- Full outer: result >= max(left, right)
- ASOF: result = left count (every left row gets a match or null, like a left join)
Join Column Validation Checklist
Before executing a join, read .stats.csv for both files and validate:
| Check | Stats Column | Red Flag | Action |
|---|---|---|---|
| Type match | type |
Join columns have different types (e.g., Integer vs String) | Cast one column before joining: sqlp with CAST(col AS INTEGER) |
| Null density | nullcount, sparsity |
sparsity > 0.3 on join column | Nulls don't match — expect unmatched rows; consider filtering nulls first |
| Value overlap | min, max |
Non-overlapping ranges across files | No rows will match — verify correct join column |
| Skew detection | mode, mode_count |
One value dominates (mode_count > 50% of rows) | Join will be heavily skewed many-to-one; verify this is expected |
| Uniqueness | uniqueness_ratio |
Both files have uniqueness_ratio < 1.0 on join column | Many-to-many join risk — expect row explosion; verify with mcp__qsv__qsv_count after |
| Outlier keys | outliers_percentage |
outliers_percentage > 5% on numeric join column | Outlier keys may not match across files; consider trimming first |
Join Types
| Type | joinp Flag |
SQL | Behavior |
|---|---|---|---|
| Inner | (default) | JOIN |
Only matching rows |
| Left | --left |
LEFT JOIN |
All left + matching right |
| Full outer | --full |
FULL OUTER JOIN |
All rows from both |
| Cross | --cross |
CROSS JOIN |
Cartesian product |
| Left Anti | --left-anti |
NOT IN / NOT EXISTS |
Left rows without match |
| Left Semi | --left-semi |
EXISTS |
Left rows with match (no right cols) |
| ASOF | --asof |
(use joinp) | Nearest-key match (temporal/numeric) |
Notes
joinpuses the Polars engine and is significantly faster thanjoinfor large files- The stats cache helps
joinpoptimize join execution - For joining on multiple columns, separate column names with commas:
columns1: "col1,col2" - Column names must match exactly (case-sensitive)
- If join columns have different names, specify separately:
columns1: "id",columns2: "customer_id" - For one-to-many joins, the result will have more rows than either input
joinphandles null values in join columns (nulls don't match by default)- ASOF joins implicitly enable
--try-parsedates— no need to pass it explicitly - For ASOF joins with subgroups, use
--left_byand--right_by(e.g., match nearest date per jurisdiction) - The
--toleranceoption (nearest strategy only) limits how far the nearest match can be: use duration strings for dates (1d,30d,365d) or positive integers for numeric keys - ASOF joins require sorted join columns; both datasets are auto-sorted unless
--no-sortis set