name: vector-bench description: Benchmark a vector dataset against OpenData Vector. Handles three modes — built-in dataset (sift, cohere, deep, …), bench-format custom (fvecs/bvecs/parquet you already have), or raw embeddings (npy / single parquet) that need to be converted first. Generates the config, generates ground truth if missing, runs the bench, and writes a summary.
vector-bench
You are running the OpenData Vector benchmark for the user. Be procedural, write artifacts to disk so the user can re-run without you, and only ask the user when there's genuine ambiguity.
Output layout
Every invocation writes to bench-runs/<run-name>/ at the repo root:
bench-runs/<run-name>/
├── bench.toml # the config you wrote (always present)
├── convert.py # the conversion script (Mode C only; copied from templates/)
├── data/ # converted files (Mode C only)
│ ├── base.parquet
│ └── query.parquet
└── summary.md # final results — always write this, even on failure
Pick <run-name> from the user's prompt (e.g. my-cohere-subset), or default
to a short slug of the input filename.
Decide the mode
Inspect the user's prompt and any path they gave you:
| Trigger | Mode |
|---|---|
Built-in name (sift1m, sift100k, cohere1m, cohere_wiki_10m, deep10m, deep1b, wikipedia_bge_m3_en, sift10m, sift50m, sift100m, sift1b) |
A: built-in |
Path to an existing dataset with base.* + query.* + ground-truth files already in fvecs/bvecs/parquet |
B: bench-format |
Path to a single .npy, or a single .parquet with embeddings but no separate query file |
C: convert |
Use Bash (ls, file) and Read to confirm. If still ambiguous, ask one
question; don't guess.
Mode A: built-in
- If the dataset is
sift100kandvector/tests/data/sift100k/base.fvecsdoesn't exist, runtar xzf vector/tests/data/sift100k/sift100k.tgz -C vector/tests/data/sift100k/. The bundled tarball is the source of truth. - For any other built-in, check whether the expected files are present under
vector/bench/data/(or$DATA_ROOT). If missing, do not silently download — point the user at the section invector/bench/README.md#datasetsand ask if they want you to run the download snippet. - Prompt the user about storage — see Storage and disk-space planning. This is not optional and not conditional on dataset size; always ask before writing the config.
- Write a minimal config (see Schema — built-in form).
- Skip to Run.
Mode B: bench-format custom
- Inspect the files the user pointed at. For
.parquet, use thepyarrowinspection snippet under Inspecting parquet to identify the embedding column, candidate id column, and candidate metadata columns. For.fvecs/.bvecs, parse the dim prefix to read dimensions from the first record. - Confirm distance metric. If the user didn't say, ask:
l2/cosine/dot_product. - Prompt the user about storage — see Storage and disk-space planning. This is not optional and not conditional on dataset size; always ask before writing the config.
- Write the full config (see Schema — custom form).
- If
ground_truth_filedoesn't exist, run gen_groundtruth before the main bench. - Run.
Mode C: convert
.npyinput: copyvector/bench/claude/skills/vector-bench/templates/convert_npy.pytobench-runs/<run-name>/convert.py, then run:python3 bench-runs/<run-name>/convert.py \ --input <user-input.npy> \ --output bench-runs/<run-name>/data/base.parquet \ --vector-column embYou now have a base file. There's no separate query file — split off the last 1000 rows as queries using
parquet_split_queries.py:cp vector/bench/claude/skills/vector-bench/templates/parquet_split_queries.py \ bench-runs/<run-name>/split_queries.py python3 bench-runs/<run-name>/split_queries.py \ --input bench-runs/<run-name>/data/base.parquet \ --base bench-runs/<run-name>/data/base.parquet \ --query bench-runs/<run-name>/data/query.parquet \ --num-queries 1000(Yes,
--basereads and writes the same file — the script reads the whole table before writing.)Single
.parquet(no separate query file): just useparquet_split_queries.pydirectly, with the user's parquet as--input. Skip the npy conversion.Confirm distance metric (ask if unclear).
Prompt the user about storage — see Storage and disk-space planning. This is not optional and not conditional on dataset size; always ask before writing the config.
Write the config in custom parquet form pointing at the new
data/base.parquet/data/query.parquetanddata/gt.parquet. Do not creategt.parquetyet — the next step generates it.Run gen_groundtruth to create
data/gt.parquet.Run.
Note: describe the split heuristic ("last 1000 rows are queries") in the summary.md so the user can see what you did.
Schema: bench.toml
Built-in form (Mode A)
[data.storage]
type = "SlateDb"
path = "<run-name>"
[data.storage.object_store]
type = "Local"
path = "/tmp/<run-name>"
[[params.recall]]
dataset = "<built-in-name>"
# All other parameters fall back to the built-in's defaults.
Custom form (Modes B and C)
[data.storage]
type = "SlateDb"
path = "<run-name>"
[data.storage.object_store]
type = "Local"
path = "/tmp/<run-name>"
[[params.recall]]
dataset = "<run-name>" # any unique string that isn't a built-in
dimensions = "<u16>"
distance_metric = "<l2|cosine|dot_product>"
format = "<fvecs|bvecs|parquet>"
base_file = "<absolute path>"
query_file = "<absolute path>"
ground_truth_file = "<absolute path>"
# Parquet-only (omit entirely if format is fvecs/bvecs):
vector_column = "<embedding column name>"
ground_truth_column = "<neighbour-list column name in GT file>"
# id_column = "<base id column>" # only if GT references ids, not row indices
# Filter-only (omit if you're not benchmarking filtered queries):
# metadata_columns = "category,brand" # comma-sep base columns ingested as attrs
# indexed_columns = "category,brand" # subset indexed for filtering (omit to index all)
# filter_spec = "/abs/path/to/filter.json"
# Tuning (sensible defaults — uncomment + edit only if the user asks):
nprobe = "100"
num_queries = "1000"
phases = "INGEST,COLD,WARM"
Defaults to pick when the user doesn't specify
nprobe=100.num_queries=min(1000, queries_available).split_threshold/merge_threshold— omit (the dataset's built-in default applies). For custom datasets without a built-in default, the bench uses1500/500, which is right for ≥1M-vector datasets. For smaller datasets (<1M), explicitly setsplit_threshold = "150"andmerge_threshold = "50".block_cache_bytes— omit. The bench picks per-phase defaults (25% of system memory for ingest, ~67% for warm/cold).- Phases —
INGEST,COLD,WARM(the default). - Storage — see Storage and disk-space planning. Always prompt the user; don't silently default to local for anything non-trivial.
Storage and disk-space planning
Always prompt the user about the storage backend before writing
bench.toml. This is non-negotiable and not conditional on dataset size. The wording of the prompt changes based on the estimated footprint (see below) but the prompt itself always happens. If you find yourself writingbench.tomlwithout having asked, you've skipped a required step.
The bench's on-disk footprint is bigger than the raw embedding bytes, and large datasets routinely fill local disk if you leave the default in place. Equally important: a local-storage run doesn't reflect S3 latency or cost, which is how production SlateDB usually runs — so even when local fits, the user often wants S3 instead.
Estimate the footprint
Rough rule: SlateDB on local disk needs about 2–3× the raw embedding
bytes (num_vectors × dim × 4), plus the converted Parquets and ground
truth for Mode C. A few orders of magnitude for perspective:
- sift100k (100K × 128 × 4 ≈ 50 MB): ~150 MB total — local is fine.
- cohere1m (1M × 768 × 4 ≈ 3 GB): ~6–9 GB total — local OK on most disks.
- cohere_wiki_10m (10M × 1024 × 4 ≈ 41 GB): ~80–120 GB — configure S3.
- sift1b (1B × 128 × 4 ≈ 512 GB): ~1 TB+ — must configure S3.
For Mode C add the size of the converted Parquets (roughly the input file
size) and the generated ground-truth file (negligible — num_queries × 40
bytes for top-10 int32 IDs).
Prompt the user about the storage backend
Always prompt the user before writing bench.toml. The default in the
config schema is [data.storage.object_store] type = "Local", which writes
to local disk and won't reflect S3 latency or cost — the bench is most
useful when configured the way you'd run SlateDB in production, which is
usually backed by object storage.
For estimated footprint ≤ 10 GB, a short prompt is enough:
Estimated bench footprint is ~N GB. I'll put SlateDB on local disk at
/tmp/<run-name>/— okay, or do you want a different local path or S3?
For footprint > 10 GB, prompt with the S3 recommendation explicit:
Estimated bench footprint is ~N GB, more than you usually want on local disk. The default storage backend is
Local, which writes to local disk and won't reflect S3 latency or cost. I'd recommend configuring S3 — tell me a bucket and region and I'll wire it up. Otherwise tell me which local path to use.
If the user picks S3, write:
[data.storage.object_store]
type = "Aws"
region = "<user>"
bucket = "<user>"
…and remind them their AWS credentials need to be discoverable by the AWS
SDK (env vars, ~/.aws/credentials, or EC2 instance metadata). The bench
will fail fast if credentials are missing.
Check free disk space before any large write
Whenever storage involves local disk — i.e. the storage backend is
Local, or for any Mode C path that writes converted Parquets / ground
truth into bench-runs/<run-name>/data/ — check free space at the target
path before running anything large:
df -P "<path>" | awk 'NR==2 {print $4 * 1024}' # bytes free
If free < estimated_footprint × 1.2 (20% headroom), stop and ask for
an alternative; don't pick one yourself:
Need ~N GB at
<path>but only M GB is free. Where should the data go?
- a different local path with more room (give me the path);
- switch storage to S3 (give me a bucket + region);
- shrink the workload — for a custom dataset, set
max_vectors = "N"to ingest only a prefix; for a built-in, pick the next smaller variant (e.g.sift10minstead ofsift100m).
Update bench.toml with the user's pick and re-check before proceeding.
Never silently write to a path the user didn't choose, even if you can
find one with enough room — surprising the user with 20 GB on the wrong
partition is worse than failing.
S3-backed runs still need local-disk checks
For S3-backed runs, free-space checks don't apply to the SlateDB data, but they do still apply to:
- the converted Parquets in
bench-runs/<run-name>/data/(Mode C only — read from local disk, not S3); - the generated
ground_truth_file(likewise local in Mode C unless pointed elsewhere); - the foyer block cache disk tier (only if
block_cache_disk_bytesis set in the config — usesblock_cache_disk_path).
Generating ground truth
Always run from the repo root:
cargo run -p vector-bench --release --bin gen_groundtruth -- \
--config bench-runs/<run-name>/bench.toml
The tool reads the same TOML the bench reads. It writes the dataset's
ground_truth_file in the matching format (ivecs for fvecs/bvecs, Parquet
for parquet). Ground truth depth is 10 by default, which is what recall@10
needs.
If filter_spec is set, the tool generates filtered ground truth
automatically — exact top-k among the rows each query's filter matches.
Before running, verify that every column the filter references is listed in
metadata_columns (otherwise the filter matches nothing and recall will look
broken).
If ground_truth_file already exists, ask the user before regenerating;
don't silently overwrite.
Run
cargo run -p vector-bench --release -- \
--config bench-runs/<run-name>/bench.toml
Stream the output. Watch for these signals:
phase: INGEST→ ingest is running.Ingested N vectors in T s→ ingest completed.recall@10 = ...→ warm phase finished.cold reader p50 = ...→ cold phase finished.- Anything
panicked at→ halt, capture the error insummary.md.
When the run completes (or fails), write summary.md.
Writing summary.md
Always write it, even on failure. Sections to include, in order:
- Dataset — 1–2 sentences describing what data was benchmarked, the row count, dim, and metric.
- Configuration — link to
bench.toml, and a small table of the key knobs (dataset, dimensions, metric, format, nprobe, split_threshold, num_queries, phases, filter_spec if any). - Results — a headline metrics table:
- recall@10
- warm QPS, warm p50/p90/p99
- cold p50/p90/p99
- ingest throughput, num vectors, num centroids
- Commentary — 1 short paragraph per metric explaining what this number
means in this setup. Don't editorialise — say things like "warm phase is
rate-limited by
query_qps_limit = 32, so this is per-query latency at that load, not max throughput." - What this did NOT measure — the standard caveats list (no metadata
filters except when
filter_specwas used, no deletes/updates, no mixed read/write, no multi-tenant). - How to reproduce — the exact
cargo runinvocation pointing at thisbench.toml. If Mode C, also point atconvert.py.
If the run failed, replace the Results section with the error captured from stderr and a one-line guess at the cause if you can identify one (common: GT file missing, vector_column wrong, dimensions mismatch).
Inspecting parquet
When you need to look at an unfamiliar parquet, prefer a short pyarrow probe over reading the whole file:
import pyarrow.parquet as pq
md = pq.read_metadata("<path.parquet>")
schema = md.schema.to_arrow_schema()
print("rows:", md.num_rows)
for f in schema:
print(f" {f.name}: {f.type}")
# peek at one row to confirm shape
tbl = pq.read_table("<path.parquet>", columns=[<vec_col>]).slice(0, 1)
print(tbl.to_pylist())
Run this via Bash with python3 -c "..." so output ends up in the
transcript. Identify:
- Embedding column: a
FixedSizeList<float32>,list<float32>, orlist<float64>— if there's only one, use it. If multiple, ask. - ID column: an
int32/int64named likeid/doc_id— used only for GT remap; usually unnecessary if you're regenerating GT. - Metadata columns: strings/ints/bools/floats other than the above — candidates to ingest if the user wants filtered queries.
When to ask the user
Always ask (every run, no exceptions):
- Storage backend — local vs. S3, and the path/bucket details. See
Storage and disk-space planning.
Yes, even for tiny datasets. The schema's
Localdefault is a schema default, not a skill default — the skill always confirms.
Ask only when ambiguity is genuine:
- Multiple float-list columns → which is the embedding?
- No metric implied →
l2(default),cosine, ordot_product? - Single-file split heuristic → "I'll use the last 1000 rows as queries — okay?"
- Pre-existing
ground_truth_file→ "Overwrite it?"
Don't ask:
- Dataset name (default to a slug of the input filename or what the user called it).
- Run-name / output path (always
bench-runs/<name>/). - Block cache config (defaults are fine).
- Phases (default
INGEST,COLD,WARM). - nprobe / split_threshold (use defaults; user can edit the TOML and re-run if they want to tune).
Defensive checks
Before running, verify:
cargois on PATH (Bash: cargo --version).- The chosen
base_file/query_fileexist and are readable. - For parquet: the columns named in
vector_column/ground_truth_column/id_column/metadata_columnsexist in the schema. - For filtered queries: every
@colreferenced in the filter spec is inmetadata_columns. - For Mode A
sift100k:vector/tests/data/sift100k/base.fvecsexists (extract the tarball if not). - The output dir
bench-runs/<run-name>/exists (mkdir -p). - The user has explicitly chosen a storage backend. Don't proceed with
the schema default (
Local) silently — see Storage and disk-space planning. - Free disk space ≥ estimated footprint × 1.2 at every local path the
run will write to (storage path if
Local;bench-runs/<run-name>/data/for Mode C;block_cache_disk_pathif hybrid cache is configured).
If any check fails, stop and write summary.md describing the problem
instead of running a bench that will silently produce garbage.