qsv-performance

star 3.7k

Performance guide covering index files, stats cache, and frequency cache accelerators for qsv

dathere By dathere schedule Updated 4/14/2026

name: qsv-performance description: Performance guide covering index files, stats cache, and frequency cache accelerators for qsv

qsv Performance Guide

Three Accelerators

1. Index Files (.csv.idx)

Created by: qsv index Used by: count, slice, sample, split, stats, frequency, schema, and others marked with ๐Ÿ“‡

Benefit Without Index With Index
Row count Scan entire file Instant (stored in index)
Random access Sequential scan O(1) lookup
Multithreaded Not possible Enabled for many commands
Slicing Read from start Jump to position

Rule: Always run index first if you'll run 2+ commands on the same file.

Auto-indexing: The MCP server auto-indexes files > 10MB.

2. Stats Cache (.stats.csv + .stats.csv.data.jsonl)

Created by: qsv stats --cardinality --stats-jsonl Used by: frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, sample (smart commands)

Smart Command What It Uses from Cache
frequency Cardinality to skip all-unique columns
schema Data types for JSON Schema generation
sqlp Column types for Polars optimization
joinp Cardinality for optimal join order
pivotp Cardinality to estimate output width
diff Column types for comparison

Rule: Run stats --cardinality --stats-jsonl before using any smart command.

Auto-caching: The MCP server auto-adds --stats-jsonl to stats commands.

3. Polars Engine

Commands: sqlp, joinp, pivotp, count (with --polars-len), schema (with --polars)

Benefit Standard (csv crate) Polars Engine
Processing model Row-by-row streaming Vectorized columnar
Memory Streaming (constant) Columnar (efficient)
Parallelism Single-threaded Multi-threaded
Large files Limited by memory Larger-than-memory
SQL support N/A Full SQL dialect

Rule: Use Polars commands (sqlp, joinp, pivotp) for files > 100MB or complex queries.

Parquet Acceleration

For repeated SQL queries on large CSV (> 10MB), consider converting to Parquet with mcp__qsv__qsv_to_parquet. Parquet is a columnar format that speeds up repeated SQL queries in mcp__qsv__qsv_sqlp. Use read_parquet('file.parquet') as the table source. DuckDB is the preferred engine for Parquet queries; mcp__qsv__qsv_sqlp with SKIP_INPUT as the input_file value also works. Note: mcp__qsv__qsv_sqlp can query CSV of any size directly โ€” Parquet is an optimization for repeated queries, not a requirement. Parquet works ONLY with mcp__qsv__qsv_sqlp and DuckDB โ€” all other qsv commands require CSV/TSV/SSV input.

Memory-Aware Command Selection

Commands That Load Entire File into Memory (๐Ÿคฏ)

dedup, reverse, sort, stats (with extended stats), table, transpose

Commands with Memory Proportional to Cardinality (๐Ÿ˜ฃ)

frequency, join, schema, tojsonl

Streaming Commands (constant memory)

Everything else - select, search, slice, replace, count, etc.

Large File Decision Tree

File size?
โ”œโ”€โ”€ < 10MB: Any command works fine
โ”œโ”€โ”€ 10MB - 100MB:
โ”‚   โ”œโ”€โ”€ Always: index first
โ”‚   โ”œโ”€โ”€ Repeated SQL: consider Parquet with qsv_to_parquet
โ”‚   โ”œโ”€โ”€ Prefer: streaming commands
โ”‚   โ””โ”€โ”€ OK: memory-intensive if < available RAM
โ”œโ”€โ”€ 100MB - 1GB:
โ”‚   โ”œโ”€โ”€ Always: index + stats cache first
โ”‚   โ”œโ”€โ”€ Repeated SQL: consider Parquet with qsv_to_parquet
โ”‚   โ”œโ”€โ”€ Prefer: Polars commands (sqlp, joinp, pivotp)
โ”‚   โ”œโ”€โ”€ Avoid: sort, reverse, table (load entire file)
โ”‚   โ””โ”€โ”€ Alternative: sqlp with ORDER BY LIMIT instead of sort
โ””โ”€โ”€ > 1GB:
    โ”œโ”€โ”€ Must: index + stats cache
    โ”œโ”€โ”€ Repeated SQL: convert to Parquet with qsv_to_parquet
    โ”œโ”€โ”€ Must: Polars commands only for joins/queries
    โ”œโ”€โ”€ Avoid: all ๐Ÿคฏ commands
    โ””โ”€โ”€ Consider: split into chunks, process, cat rows

Performance Tips

Tip Why
Use --output file.csv Avoids stdout buffering overhead
Use count before stats Fast row count for progress bars
Use select early in pipeline Reduce columns = faster processing
Use --no-headers only when needed Header detection is cheap
Use slice --len N for previews Don't read entire file to inspect
Prefer joinp over join Polars engine is significantly faster
Use frequency --limit N Don't compute all unique values
Use stats --cardinality Enables smart optimizations downstream

Concurrent Operations

The MCP server limits concurrent qsv operations (default: 1). For multiple independent files, the agent can issue separate tool calls.

Timeout Handling

  • Default timeout: 10 minutes (QSV_MCP_OPERATION_TIMEOUT_MS)
  • Long operations (sort on huge files) may timeout
  • If timeout occurs: try Polars alternative or split the file
  • Exit code 124 indicates timeout
Install via CLI
npx skills add https://github.com/dathere/qsv --skill qsv-performance
Repository Details
star Stars 3,678
call_split Forks 103
navigation Branch main
article Path SKILL.md
More from Creator