name: benchmark-memory description: Systematic benchmarking framework for Local Brain Search memory system with LLM-as-judge scoring automation: gated allowed-tools: - Bash - Read - Write - Glob - Grep - Task user-invocable: true
Benchmark Memory System
Systematic benchmarking framework to measure retrieval quality, compare configurations, and identify optimal parameters for the Local Brain Search memory system.
Purpose
- Measure retrieval quality objectively using LLM-as-judge scoring
- Compare different configuration settings (spreading vs static, parameter sweeps)
- Identify optimal parameters for different query types (factual, conceptual, synthesis)
- Generate reproducible results against frozen test datasets
Design Principles
- Contained: Skill + sub-agent + bundled scripts
- Reproducible: Test against frozen Brain snapshot
- Automated: LLM-as-judge for relevance scoring
- Analyzable: CSV output for analysis
State Dependencies
| Source | Location | Read | Write |
|---|---|---|---|
| Brain snapshot | .claude/skills/benchmark-memory/snapshots/ |
Yes | Yes |
| Query sets | .claude/skills/benchmark-memory/query-sets/ |
Yes | Yes |
| Benchmark results | .claude/skills/benchmark-memory/results/ |
Yes | Yes |
| Analysis reports | .claude/skills/benchmark-memory/analysis/ |
No | Yes |
| Memory system | resources/local-brain-search/ |
Yes | No |
Prerequisites
- Local Brain Search system indexed (
resources/local-brain-search/data/brain.faiss) - Python venv at
resources/local-brain-search/venv/with search dependencies - Claude Code CLI installed and authenticated (for LLM-as-judge scoring via headless mode)
LLM-as-Judge Scoring
This skill uses Claude Code headless mode (claude -p) for LLM relevance scoring, not a separate API key. This means:
- No
ANTHROPIC_API_KEYenvironment variable needed - Uses your existing Claude Code authentication
- Default model:
sonnet(good quality) - can also usehaiku(faster/cheaper) oropus - JSON output via prompt engineering for reliable scoring
To verify Claude Code is available:
claude --version
Installing Dependencies
Dependencies are installed in the local-brain-search venv:
cd resources/local-brain-search
source venv/bin/activate
pip install pandas tqdm # anthropic not required - uses Claude Code headless
Sub-Commands
/benchmark-memory setup
Create a frozen Brain snapshot and build its index.
cd .claude/skills/benchmark-memory/scripts
./run_benchmark.sh --list-snapshots # Check existing
python3 create_snapshot.py # Create new snapshot
What it does:
- Creates snapshot directory with date stamp
- Copies Brain folder (excluding .obsidian, .trash)
- Builds FAISS index for the snapshot
- Creates SNAPSHOT-INFO.md with metadata
/benchmark-memory create-queries [--count N]
Generate or manage test query sets.
cd .claude/skills/benchmark-memory/scripts
python build_query_set.py --count 50 --output ../query-sets/core-50.json
Query Categories (50 total):
| Category | Count | Example |
|---|---|---|
| Factual | 10 | "What is dopamine?" |
| Conceptual | 10 | "How does motivation work?" |
| Synthesis | 15 | "Connect Buddhism and neuroscience" |
| Temporal | 5 | "Recent notes about AI agents" |
| Needle | 5 | "Note about intermittent reinforcement" |
| Broad | 5 | "Identity" |
/benchmark-memory run [--config CONFIG] [--snapshot SNAPSHOT]
Execute benchmark with specified configuration.
cd .claude/skills/benchmark-memory/scripts
./run_benchmark.sh --config focused --snapshot brain-snapshot-2026-02-18
./run_benchmark.sh --dry-run --config focused # Preview without execution
./run_benchmark.sh --list-configs # List available configs
Configurations:
focused: 15 key configurations (recommended for initial benchmarking)single:CONFIG_NAME: Run single configurationall: Full parameter sweep (expensive)
Estimated cost per run:
- 50 queries x 10 results x 15 configs = 7,500 LLM scores
- Using Sonnet (default): ~$75
- Using Haiku: ~$7.50
/benchmark-memory analyze [--results FILE]
Generate analysis summary from benchmark results.
cd .claude/skills/benchmark-memory/scripts
python analyze_results.py --results ../results/benchmark-*.csv
Outputs:
- Summary by configuration
- Summary by query category
- Best config per intent
- Recommendations
Workflow
Step 1: Setup (one-time)
/benchmark-memory setup
|
v
Step 2: Create Queries (one-time)
/benchmark-memory create-queries --count 50
|
v
Step 3: Run Benchmark (per experiment)
/benchmark-memory run --config focused
|
v
Step 4: Analyze Results
/benchmark-memory analyze
Metrics Collected
Performance Metrics
| Metric | Description |
|---|---|
latency_ms |
Query execution time |
iterations |
Spreading iterations used |
converged |
Whether spreading converged |
Quality Metrics
| Metric | Range | Description |
|---|---|---|
| Precision@K | 0-1 | Fraction of results that are relevant |
| Recall@K | 0-1 | Fraction of relevant notes found |
| MRR | 0-1 | Mean Reciprocal Rank |
| NDCG@K | 0-1 | Ranking quality with position discount |
| Avg Score | 0-3 | Average LLM relevance score |
LLM-as-Judge Scoring Scale
| Score | Label | Definition |
|---|---|---|
| 0 | Irrelevant | No connection to query |
| 1 | Tangential | Loosely related |
| 2 | Relevant | Addresses the query |
| 3 | Highly Relevant | Directly answers the query |
Configurations to Test
Baseline
static_baseline: Traditional vector searchspreading_default: Spreading activation with defaults
Parameter Sweeps
- Iteration count: 2, 5, 7
- Inhibition strength: 0.1, 0.3, 0.5
- Temporal decay: 0.8, 0.9, 0.95
- Q-weight: 0.0, 0.3, 0.5
Optimized Combinations
synthesis_optimized: max_iterations=7, inhibition=0.1factual_optimized: max_iterations=2, inhibition=0.5balanced_optimized: max_iterations=5, inhibition=0.2, decay=0.85
Output Files
Results CSV
results/benchmark-YYYY-MM-DD-HHMMSS.csv
Schema:
timestamp,config_name,query_id,query_category,mode,max_iterations,
inhibition_strength,latency_ms,result_1_note,result_1_score,...,
precision_at_5,precision_at_10,mrr,ndcg_at_10,avg_score
Analysis Report
analysis/report-YYYY-MM-DD.md
Error Recovery
Partial benchmark run
Results are appended incrementally. Resume by running with --resume:
python run_benchmark.py --config focused --resume
API rate limits
Built-in retry with exponential backoff. Adjust --delay if needed:
python run_benchmark.py --config focused --delay 1.0
Invalid snapshot
Re-create snapshot:
python create_snapshot.py --force
Success Criteria
- Snapshot creation produces valid index
- Query set covers all 6 categories (50+ queries)
- LLM judge produces consistent scores (>80% agreement on re-run)
- All 15 focused configs can be benchmarked
- Results CSV is valid and analyzable
- Analysis identifies best config per query type
Expected Insights
After running benchmarks, answer these questions:
- Does spreading beat static? For which query types?
- What's the optimal iteration count for synthesis queries?
- Does high inhibition help factual queries?
- Does q_weight > 0 improve results over time?
- What settings work best for each query type?
Cost Management
| Strategy | Savings | Trade-off |
|---|---|---|
| Score top 5 only | 50% | Less data on long-tail |
| Use Haiku judge | 90% | Slightly less accurate |
| Cache scores | Variable | Only for unchanged retrieval |
Recommendation: Start with Haiku judge, validate sample against Sonnet.
Directory Structure
.claude/skills/benchmark-memory/
├── SKILL.md # This file
├── requirements.txt # Python dependencies
├── scripts/
│ ├── run_benchmark.sh # Wrapper script (uses venv Python)
│ ├── create_snapshot.py # Create frozen Brain snapshot
│ ├── build_query_set.py # Generate/manage query test set
│ ├── run_benchmark.py # Execute benchmark with config
│ ├── score_results.py # LLM-as-judge scoring
│ ├── compute_metrics.py # Calculate evaluation metrics
│ └── analyze_results.py # Generate analysis summary
├── configs/
│ ├── focused_configs.json # Test configurations (15 configs)
│ └── judge_prompt.txt # LLM judge prompt template
├── snapshots/ # Frozen Brain copies
├── query-sets/ # Test queries (core-50.json included)
├── results/ # Benchmark CSVs
└── analysis/ # Analysis reports
Tested Components
| Component | Status | Notes |
|---|---|---|
| Snapshot creation | ✅ Works | Creates snapshot with FAISS index + graph |
| Query set | ✅ Works | 50 queries across 6 categories |
| Static search | ✅ Works | Traditional vector similarity |
| Spreading search | ✅ Works | Multi-iteration activation |
| 15 configs | ✅ Works | Focused parameter sweep |
| LLM-as-judge | ✅ Works | Uses Claude Code headless mode (claude -p) |
| Results CSV | Ready | Incremental writes, resume support |