name: dct-diff description: Use this skill when the user wants to compare two data files, find differences between datasets, validate data consistency, check if files have matching records, or reconcile data between sources. Triggers include "compare these files", "diff the datasets", "are these the same", "find differences", "validate data matches", "reconcile", "data comparison", or when doing data quality validation between two files.
DCT Diff - Compare Datasets
Compare two data files with key matching and optional aggregation metrics.
When to Use
Use this skill when you need to:
- Validate data consistency between two versions
- Compare production vs test data
- Reconcile data after ETL processes
- Check for data drift over time
- Validate data migrations
Installation
which dct || go build -o dct && chmod +x ./dct
Usage
dct diff <keys> <file1> <file2> [flags]
Arguments
keys: Key column(s) for matching records. Formats:- Single key:
id - Composite keys:
key1,key2 - Different names:
left_col=right_col
- Single key:
file1: First data file (left side)file2: Second data file (right side)
Flags
-m, --metrics <spec>: Metrics specification (JSON string or file path)-a, --all: Show all metrics columns-o, --output <file>: Output to file instead of stdout
Examples
Basic Comparison
Compare by single key:
dct diff id left.csv right.csv
Compare by composite keys:
dct diff "first_name,last_name" file1.parquet file2.parquet
Key Name Mapping
When key columns have different names:
dct diff user_id=customer_id old.csv new.csv
With Metrics
Compare with count distinct metric:
dct diff id left.csv right.csv -m '[{"agg":"count_distinct","left":"email","right":"email"}]'
Multiple metrics:
dct diff id left.csv right.csv -m '[{"agg":"mean","left":"amount","right":"amount"},{"agg":"count_distinct","left":"category","right":"category"}]'
Load metrics from file:
dct diff id left.csv right.csv -m metrics.json -a
Metrics Specification
JSON array of metric objects:
[
{
"agg": "count_distinct",
"left": "column_name",
"right": "column_name"
}
]
Available Aggregations
mean- Average valuemedian- Median valuemin- Minimum valuemax- Maximum valuesum- Sum of valuescount- Count of recordscount_distinct- Count of unique values
Output Columns
Default output includes:
- Key column(s)
l_cnt- Count from left filer_cnt- Count from right filecnt_eq- Whether counts match
With metrics and -a flag:
l_<col>_<agg>- Left aggregationr_<col>_<agg>- Right aggregation<col>_<agg>_eq- Whether aggregations match
Best Practices
- Use
-aflag to see all comparison metrics - Both files must contain the key columns
- Files must have at least one row of data
- Start with a small sample to verify keys work
- Use composite keys when single keys aren't unique
Error Handling
Common issues:
attempted to diff when least one of the files have no data: Check files aren't empty- Key not found: Verify column names match exactly (case-sensitive)
- Format errors: Ensure metrics JSON is valid
Example Workflow
# 1. Preview both files first
dct peek left.csv -n 3
dct peek right.csv -n 3
# 2. Compare by ID
dct diff id left.csv right.csv -a
# 3. Save results
dct diff id left.csv right.csv -m metrics.json -a -o comparison.csv
Related Skills
dct-peek: Preview files before comparingdct-profile: Check data quality of each file