name: search-benchmark description: Generate a search quality benchmark for the AI Registry. Generates ground truth from the registry's assets, runs 100+ queries against the semantic search API, evaluates results using NDCG@10/MRR/Recall, and produces a markdown report. Use when you want to measure search quality after changes to the scoring algorithm, embedding model, or indexed content. license: Apache-2.0 metadata: author: mcp-gateway-registry version: "1.0"
Search Benchmark Skill
Measure semantic search quality against a deployed AI Registry. Generates a ground truth dataset from the registry's own assets, runs queries against the live API, evaluates results using standard information retrieval metrics (NDCG@10, MRR, Recall@10), and produces a markdown report.
Prerequisites
- Registry URL - The base URL of the deployed registry (e.g.,
https://d2xl2zfuhgc4l0.cloudfront.net) - JWT Token - A valid admin token in
.tokenfile (get from "Get JWT Token" button in registry UI) - Registry must have assets indexed - At least some servers, agents, or skills registered
The .token file supports both raw JWT format and the full JSON response from the registry UI.
Input
/search-benchmark [REGISTRY_URL] [TOKEN_FILE]
- REGISTRY_URL - Base URL of the registry to benchmark (default: reads from user or uses
http://localhost) - TOKEN_FILE - Path to the token file (default:
.token)
Workflow
Step 1: Check for Existing Ground Truth
Check if a ground truth dataset already exists:
ls tests/fixtures/search_dataset/ground_truth.json 2>/dev/null
If the file exists, report how many queries it contains and ask the user: "A ground truth dataset already exists (N queries). Do you want to use it or generate a new one from this registry?"
- If use existing: skip to Step 2
- If generate new: proceed to Step 1b
Step 1b: Generate Expert Ground Truth
This is NOT a simple programmatic generation. You must deeply analyze the registry's assets and craft queries like a search expert. Follow this process:
1b.1: Fetch all assets as JSON
TOKEN=$(python3 -c "
import json
with open('{TOKEN_FILE}') as f:
raw = f.read().strip()
if raw.startswith('{'):
data = json.loads(raw)
print(data.get('tokens',{}).get('access_token') or data.get('access_token',''))
else:
print(raw.replace('Bearer ',''))
")
curl -s -H "Authorization: Bearer $TOKEN" "{REGISTRY_URL}/api/servers?limit=2000" > /tmp/servers.json
curl -s -H "Authorization: Bearer $TOKEN" "{REGISTRY_URL}/api/agents?limit=2000" > /tmp/agents.json
curl -s -H "Authorization: Bearer $TOKEN" "{REGISTRY_URL}/api/skills?limit=2000" > /tmp/skills.json
1b.2: Analyze the assets deeply
Read the JSON dumps. Filter out stress-test and security-pending assets. For each real asset, understand:
- What it does (from description)
- What tools it has (tool names and descriptions)
- What tags it uses
- How users would naturally search for it
1b.3: Craft 100 queries across these categories
| Category | Count | Strategy |
|---|---|---|
| exact-name | 10 | Product/server/agent names as queries |
| semantic | 15 | Natural language paraphrases with NO keyword overlap |
| tool-precision | 10 | Exact tool names as queries |
| tag-based | 10 | Tag vocabulary combinations |
| multi-entity | 10 | Queries where correct answers span servers + agents + skills |
| conflict | 10 | Ambiguous terms where vector and keyword disagree |
| no-answer | 10 | Queries completely outside the dataset (no match exists) |
| agent-focused | 10 | Queries specifically targeting agents |
| skill-focused | 10 | Queries specifically targeting skills |
| tricky | 15 | Edge cases, short queries, non-English, adversarial |
1b.4: For each query, define expected results
{
"query": "the search query",
"category": "one of the categories above",
"description": "what this query tests and why",
"expected": [
{"path": "/exact-path-from-registry", "grade": 3, "reason": "why this should match"},
{"path": "/another-path", "grade": 2, "reason": "why this is relevant but not perfect"}
]
}
Grade scale: 3 = perfect match, 2 = highly relevant, 1 = somewhat relevant. For no-answer queries, expected should be an empty list.
1b.5: Validate all paths exist
Every path in expected results must exist in the fetched assets. Verify before saving.
1b.6: Save
Write the ground truth to tests/fixtures/search_dataset/ground_truth.json.
Tell the user how many queries were created per category and that they should review it.
Step 2: Run Benchmark
Run all queries against the live semantic search API and generate a report:
uv run python scripts/benchmark_search.py \
--url {REGISTRY_URL} \
--token-file {TOKEN_FILE} \
--queries tests/fixtures/search_dataset/generated_ground_truth.json
Output:
tests/fixtures/search_dataset/benchmark_results.json(raw results)tests/fixtures/search_dataset/benchmark_results.md(markdown report)
Step 3: Review Report
Show the user the key metrics from the report:
- Quality Metrics - NDCG@10 > 0.7 is good, > 0.8 is excellent
- Score Health - Saturated scores at 1.0 should be < 15% (if higher, scoring formula may have issues)
- Quality by Category - Identify weak areas (e.g., agent-focused queries underperforming)
- Per-query results - Check specific queries that scored 0.0 (complete miss) or low NDCG
Open the report in the editor for the user to review.
Step 4: Compare Before/After (Optional)
If asked to compare two runs (e.g., before and after a scoring algorithm change):
uv run python scripts/benchmark_search.py \
--compare tests/fixtures/search_dataset/benchmark_results.json other_results.json
Output Format
The report includes:
- Registry metadata: version, database backend, server/agent/skill counts
- Quality Metrics: NDCG@10, MRR, Recall@10 averaged across all queries
- Score Health: saturation analysis (unique scores, % at 1.0, range)
- Quality by Category: breakdown per query type
- Per-query results: top 5 results with scores and ground truth comparison
Interpreting Metrics
| Metric | Good | Excellent | Poor |
|---|---|---|---|
| NDCG@10 | > 0.65 | > 0.80 | < 0.50 |
| MRR | > 0.70 | > 0.85 | < 0.50 |
| Recall@10 | > 0.75 | > 0.90 | < 0.60 |
| Score saturation | < 15% | < 5% | > 30% |
Example
$ /search-benchmark https://d2xl2zfuhgc4l0.cloudfront.net .token
This will:
- Generate ~90 queries from the registry's assets
- Run each query against the semantic search API
- Produce a report showing search quality metrics
- Open the report for review
Troubleshooting
- 401 errors: Token expired. Get a fresh one from the registry UI "Get JWT Token" button.
- 0 assets found: Registry has no servers/agents/skills registered, or token lacks permissions.
- Low NDCG on "description-derived" category: Expected for programmatically generated queries. Add hand-curated queries for better evaluation.
- High saturation (>30% at 1.0): Scoring formula may be using the legacy additive method. Set
SEARCH_FUSION_METHOD=rrfon the registry.
Related
- Search Evaluation Harness README - Full documentation on the evaluation system
- Search Quality Benchmark (reference) - Sample results from development registry
- Hybrid Search Architecture - How scoring works