data-quality

star 3.7k

Quality dimensions quick reference and remediation decision tree for tabular data assessment

dathere By dathere schedule Updated 4/12/2026

name: data-quality description: Quality dimensions quick reference and remediation decision tree for tabular data assessment

Data Quality Assessment with qsv

For the full step-by-step profiling workflow, use the /data-profile command. This skill provides quick-reference guidance for quality assessment and remediation decisions.

Quality Dimensions (Quick Reference)

Dimension Key Question Primary Check Red Flag
Completeness Missing values? statsnullcount, sparsity Sparsity > 0.5
Uniqueness Unwanted duplicates? stats --cardinality vs row count Key column cardinality < row count
Validity Correct formats/types? statstype; validate schema.json String type on numeric column
Consistency Uniform formats? frequency — case variants; sniff — encoding Same value in different cases
Accuracy Plausible values? stats — min/max/stddev Values > 3 stddev from mean
Column Name Quality Headers safe & descriptive? safenames --verify Spaces, special chars, or duplicates in headers
Conformity Values follow standards? searchset with domain regex Non-standard codes (country, state, zip, phone)
Referential Integrity Foreign keys valid? joinp --left-anti Orphaned references across related files
Injection Safety Malicious payloads? searchset with injection regex Formula/SQL injection patterns in cells
Documentation Dataset described? describegpt --all No Data Dictionary or Description

Remediation Decision Tree

When a quality issue is found, choose the right fix:

Problem Severity Fix Command When to Skip
Ragged rows High fixlengths Never — breaks downstream tools
Wrong encoding High input File is already UTF-8 (check with sniff)
Unsafe column names Medium safenames Headers already safe (no spaces/special chars)
Leading/trailing whitespace Medium sqlp with TRIM(col) Stats show no difference between min/max lengths and trimmed values
Duplicate rows Medium dedup (or extdedup for >1GB) stats --cardinality on key columns shows all unique
Inconsistent case Low sqlp with UPPER(col) or LOWER(col) frequency shows no case variants
Empty values Low sqlp with COALESCE(NULLIF(col, ''), 'N/A') Nulls are semantically meaningful
Non-conforming values Medium searchset + search --flag No domain standard applies
Orphaned foreign keys Medium joinp --left-anti Single-file dataset with no references
Injection payloads High searchset with injection regex + sanitize Data is internal-only and never opened in spreadsheets or loaded into databases
Invalid rows Low validate schema.json + filter No schema available

Fix Ordering

Always apply fixes in this order to avoid cascading issues:

1. input          (encoding — must be UTF-8 before anything else)
2. safenames      (headers — fixes names before column references)
3. fixlengths     (structure — ensures consistent field counts)
4. sqlp with TRIM()    (whitespace — clean values before dedup)
5. dedup          (duplicates — remove after trimming so "foo " and "foo" match)
6. validate       (validation — check against schema last)

Stats Cache as Quality Dashboard

After running stats --cardinality --stats-jsonl (basic moarstats auto-runs), read the .stats.csv cache to assess quality in one pass:

Cache Column Quality Signal
nullcount Completeness — 0 is ideal
sparsity Completeness — ratio of nulls (0.0–1.0)
cardinality Uniqueness — compare to row count
type Validity — check expected types
min / max Accuracy — plausible range?
mean / stddev Accuracy — outlier detection (>3σ)
outliers_total_cnt Accuracy — from moarstats; outlier count per column
mode Consistency — dominant value expected?

Advanced Stats (via moarstats --advanced)

Run moarstats --advanced to enrich the cache with distribution shape metrics:

Cache Column Quality Signal
kurtosis >3 heavy tails (outlier-prone), <3 light tails; >10 = extreme outliers
bimodality_coefficient >=0.555 suggests bimodal distribution (possible mixed populations)
jarque_bera_pvalue <0.05 = NOT normally distributed; flag analyses assuming normality
gini_coefficient Near 1 = extreme concentration; near 0 = uniform
shannon_entropy Low = concentrated values; high = diverse
winsorized_mean Compare to mean — large difference signals outlier influence
median_mean_ratio <0.8 or >1.2 = significantly skewed; mean may be misleading
range_stddev_ratio Very high = extreme outliers relative to variability
cv >100% = high relative variability; data is highly spread relative to mean
mad_stddev_ratio >0.8 = stddev is reliable; <<0.8 = outliers inflating stddev
mode_zscore Far from 0 = mode is atypical; possible mixed populations
trimean Robust central tendency: (Q1 + 2*median + Q3)/4; compare to mean for skew detection
midhinge Midpoint of middle 50%: (Q1+Q3)/2; robust center measure
robust_cv MAD/abs(median); outlier-resistant coefficient of variation
theil_index Inequality measure (0=equal); decomposable into within/between group; only for positive values
mean_ad Average absolute distance from mean; less sensitive to outliers than stddev
simpsons_diversity_index Probability two random values differ (0-1); more intuitive than entropy
Install via CLI
npx skills add https://github.com/dathere/qsv --skill data-quality
Repository Details
star Stars 3,678
call_split Forks 103
navigation Branch main
article Path SKILL.md
More from Creator