keyless-evaluator - SKILL.md Agent Skill

name: keyless-evaluator description: > LLM-as-judge search result evaluator. Scores search results 0-3 (Irrelevant → Highly Relevant) using Gemini, OpenAI, Anthropic, ChatGPT Web, Codex CLI, or Claude CLI. Runs with `uv run`. Supports standard {id, title, snippet} input AND dynamic raw JSON from any search API via /v1/evaluate with auto field detection. Use this skill when: evaluating search quality, running LLM-as-judge scoring, ranking results, computing nDCG, or integrating the REST API. For speed-sensitive use: prefer claude_cli (3s) or codex (5–10s) over browser providers.

Keyless Evaluator — Agent Skill

Quick Reference

# Sync deps (Python 3.13, uv)
uv sync

# CLI
uv run keyless-eval --help
uv run keyless-eval eval -q "query" -f results.json               # Gemini (default, free)
uv run keyless-eval eval -q "query" -f results.json -p chatgpt_web # no key needed
uv run keyless-eval eval -q "query" -f results.json -p anthropic
uv run keyless-eval providers

# HTTP server
uv run main.py              # → http://127.0.0.1:8510  docs: /docs
uv run main.py --host 0.0.0.0 --port 8080   # custom bind
uv run main.py --reload                       # dev mode

# Tests
uv run pytest tests/ -v

macOS sandbox: set UV_PROJECT_ENVIRONMENT=/tmp/keyless-eval-venv if .venv fails.

Architecture

keyless_evaluator/
├── cli.py          # Typer CLI (eval, detail, example, providers, serve)
├── models.py       # Pydantic: RelevanceScore, SearchResult, EvaluationRequest/Response,
│                   #           FieldMapping, RawEvaluationRequest
├── prompts.py      # SYSTEM_PROMPT + build_user_prompt
├── parser.py       # Parse raw LLM JSON → list[ResultScore], robust fence stripping
├── evaluators.py   # Backends: GeminiEvaluator, OpenAIEvaluator, AnthropicEvaluator,
│                   #           ChatGPTWebEvaluator, GeminiWebEvaluator,
│                   #           CLIEvaluator (codex + claude_cli) + factory get_evaluator()
├── adapter.py      # Dynamic raw JSON adapter: dot-path resolver, auto field detection
├── renderer.py     # Rich terminal: tables, detail panels, nDCG stats
└── server.py       # FastAPI: POST /v1/evaluate, POST /v1/evaluate/raw, GET /health

Providers

Provider	Key/CLI Needed	Default Model	Speed	Notes
`gemini` (default)	`GEMINI_API_KEY`	`gemini-2.0-flash`	⚡ ~0.5–2s	Free 1500 req/day
`claude_cli`	`claude` CLI	`claude-opus-4-6`	⚡ ~3s	Claude Code pipe mode — fastest no-key option
`codex`	`codex` CLI + ChatGPT Plus	`gpt-5.4`	🟡 ~5–10s	OpenAI Codex CLI; use `gpt-5.4-mini` for ~3.7s
`openai`	`OPENAI_API_KEY`	`gpt-4o`	⚡ ~2s	OpenAI API
`anthropic`	`ANTHROPIC_API_KEY`	`claude-3-5-haiku-20241022`	⚡ ~2s	Anthropic Claude
`chatgpt_web`	None	`auto`	🐢 4–6 min	Browser automation — last resort only
`gemini_web`	Google login	`auto`	🐢 1–2 min	Browser automation — last resort only

Speed order

gemini API (~0.5s) > claude_cli (~3s) > codex (~5–10s) > gemini_web (~1–2 min) > chatgpt_web (4–6 min)

Always prefer API or CLI providers over browser automation (chatgpt_web, gemini_web).

CLI provider setup

claude_cli — pipe mode via Claude Code:

npm install -g @anthropic-ai/claude-code
# authenticated automatically if you use Claude Code

codex — OpenAI Codex CLI with ChatGPT Plus:

npm install -g @openai/codex
codex login   # authenticate with ChatGPT Plus account

Use gpt-5.4-mini for faster responses (~~3.7s), or omit model for the default gpt-5.4 (~~5–10s). The provider always uses model_reasoning_effort=low — sufficient for 0–3 scoring tasks.

# Fast codex eval
curl -X POST "http://localhost:8510/v1/evaluate?provider=codex&model=gpt-5.4-mini" \
  -H "Content-Type: application/json" \
  -d '{"input": "senior python backend hanoi", "output": {...}}'

# claude_cli eval
curl -X POST "http://localhost:8510/v1/evaluate?provider=claude_cli" \
  -H "Content-Type: application/json" \
  -d '{"input": "senior python backend hanoi", "output": {...}}'

API Endpoints

POST /v1/evaluate — standard structured input

{
  "query": "remote jobs",
  "results": [
    {"id": "r1", "title": "...", "snippet": "...", "url": "...", "metadata": {}}
  ]
}

POST /v1/evaluate/raw — paste any search API response directly

{
  "query": "remote jobs",
  "max_results": 10,
  "raw": { ...any search API response body... },
  "mapping": {
    "data_path": "data",
    "id_field": "id",
    "title_field": "jobTitle",
    "snippet_field": "jobDescription",
    "metadata_fields": ["company", "salary", "location", "employmentTypeEn"]
  }
}

All mapping fields are optional — auto-detected from common names if omitted. Auto-detected array keys: data, results, hits, items, docs, records, jobs. Auto-detected title candidates: title, jobTitle, name, headline, subject, label. Auto-detected snippet candidates: snippet, jobDescription, description, summary, body.

adapter.py — Key Functions

adapt_raw_input(raw, mapping, max_results) → list[SearchResult]
_resolve_path(obj, "dot.notation.path") → nested value
_scalar_value(val) → flattens lists (["Hà Nội", "HCM"] → "Hà Nội, HCM")
_auto_metadata(item, exclude_keys) → picks best scalar fields, max 12

Adding a New Provider

Create class in evaluators.py extending BaseEvaluator
Implement async def evaluate(self, request: EvaluationRequest) -> EvaluationResponse
Use self._build_response(request, scores) to wrap results
Add to PROVIDER_MAP and _DEFAULT_MODELS

Ubuntu Server Setup (with Desktop + Chrome)

# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh && source ~/.bashrc

# 2. Get the code (first time)
git clone https://github.com/arecavn/keyless-evaluator.git
cd keyless-evaluator

# OR if downloaded as zip:
cd ~/Downloads/keyless-evaluator-main
git init && git remote add origin https://github.com/arecavn/keyless-evaluator.git

# 3. Install deps + Playwright browsers
uv sync
uv run playwright install --with-deps chromium

# 4. Install real Chrome
wget -q -O /tmp/chrome.deb https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install -y /tmp/chrome.deb

# 5. Copy .env
cp .env.example .env   # then edit with your keys

Pull updates (after first setup)

git fetch origin && git reset --hard origin/main && uv sync && sh support/restart.sh

CDP setup (required for chatgpt_web)

Headless Ubuntu server (no display — always use single-line commands, multi-line \ breaks in Tabby):

google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chatgpt-cdp-profile --no-first-run --no-default-browser-check --no-sandbox --disable-gpu --headless=new &

Verify Chrome started:

curl http://127.0.0.1:9222/json/version

Ubuntu with desktop (to log in manually):

google-chrome --user-data-dir=/tmp/chatgpt-cdp-profile --remote-debugging-port=9222 --no-first-run --no-default-browser-check

Log in to ChatGPT in that window
Navigate to your GPT project URL and solve any WAF challenge once
Keep Chrome open

Add to .env:

CHATGPT_CDP_URL=http://127.0.0.1:9222
CHATGPT_WEB_HEADLESS=0

Verify Chrome is listening:

curl http://127.0.0.1:9222/json/version

Start server:

sh support/restart.sh

IMPORTANT: Always use sh support/restart.sh to restart — NEVER use pkill -f main.py as it kills all Python main.py processes across all projects on the machine.

Notes:

Default host is 0.0.0.0:8510 — accessible from network
Chrome must stay open while server runs
xclip not needed — server uses execCommand fallback when $DISPLAY is unavailable
To update: always git fetch origin && git reset --hard origin/main (not just git pull)

Docker Deployment

# Build
docker build -t keyless-evaluator:latest .

# Run (port 8511 host → 8510 container)
docker compose up -d

# Health check
curl http://localhost:8511/health

Dockerfile layer cache rules (important when editing):

Layers 1-2: OS packages + uv — almost never invalidated
Layer 3: pyproject.toml / uv.lock — invalidated when adding/upgrading deps
Layer 4: uv sync --no-install-workspace — external wheels, cached by uv.lock
Layer 5: playwright install — ~100 MB, only reruns when playwright version changes
Layer 6: source code (api/, main.py) — copy here last so code edits reuse all above layers
Layer 7: uv sync (local package) — fast (<1 s), runs after source copy

Never move source COPY above playwright install — it breaks cache for expensive layers.

chatgpt_web on Mac — CDP Connect Mode (recommended)

WAF blocks any fresh Chrome profile launch (even real Chrome). The reliable fix is CDP connect mode: open Chrome once with a debug port, log in manually, keep it running. All evaluations reuse that live session — no WAF challenges.

# 1. Open Chrome with debug port (run once, keep window open)
open -na "Google Chrome" --args \
  --user-data-dir=/tmp/chatgpt-cdp-profile \
  --remote-debugging-port=9222

# 2. In that Chrome window: go to chatgpt.com, solve WAF once, log in

# 3. Add to .env:
CHATGPT_CDP_URL=http://localhost:9222
CHATGPT_WEB_HEADLESS=0

# 4. Restart the server — done. Chrome must stay open while server runs.

Why it works: connecting via CDP reuses the existing browser session. WAF sees a real, already-trusted browser — no new launch, no challenge.

Env vars:

CHATGPT_CDP_URL — if set, connects to existing Chrome via CDP (ignores all launch/profile settings)
CHATGPT_WEB_HEADLESS=0 — keep visible so the window stays accessible
CHATGPT_PROFILE_DIR — custom profile dir (default: ~/.local/share/keyless-eval/chatgpt)

Browser launch fallback order (when CDP not set):

/Applications/Google Chrome.app — real Chrome binary (most trusted by WAF)
/Applications/Chromium.app
/usr/bin/google-chrome[-stable]
channel="chrome" — Playwright's Chrome (may be "Chrome for Testing" on ARM Mac, WAF detects it)

Launch args: _get_stealth_args(headless) — minimal clean args on Mac visible mode (removes --disable-web-security, --use-gl=swiftshader, --disable-gpu which look suspicious to WAF on a machine with a real GPU and display).

chatgpt_web / gemini_web — Google Login in Docker

Google OAuth cannot be done inside headless Docker. Workflow:

# 1. Login on Mac once (visible browser, CDP mode)
#    Follow Mac CDP setup above, then copy the profile to Docker:

# 2. Sync the saved session profile to Docker volume (via helper container)
docker run --rm \
  -v /tmp/chatgpt-cdp-profile:/src:ro \
  -v keyless-evaluator_chatgpt-profile:/dst \
  alpine sh -c "cp -rf /src/. /dst/"

# 3. Restart container
docker compose restart

The container uses Xvfb (virtual display) + CHATGPT_WEB_HEADLESS=0 so Chromium runs "headed" on a fake screen — avoids WAF bot-detection that targets --headless mode.

Sessions expire ~30 days. Re-run step 1-2 to refresh without rebuilding the image.

Vercel Deployment

vercel env add GEMINI_API_KEY
vercel env add ALLOWED_ORIGINS   # comma-separated CORS origins
vercel deploy --prod

chatgpt_web (Playwright) cannot run on Vercel Lambda — use API providers only.

Common Issues

"LLM did not return valid JSON" — json-repair auto-fixes unescaped quotes; check logs/llm.log
chatgpt_web WAF block — use CDP connect mode (see above); fresh profiles always get challenged
chatgpt_web WAF block in Docker — ensure Xvfb is running and CHATGPT_WEB_HEADLESS=0
Playwright not found — uv run playwright install chromium
Gemini 429 — free tier: 15 req/min, 1500 req/day; add delay or upgrade tier
"Could not find result array" — set mapping.data_path to the array key name
max_results cap — default 20, max 500; set high (50-100) for chatgpt_web to save quota

Commit Message Rules

Co-author line: always use Co-Authored-By: AI IDE
Never use Co-Authored-By: Claude ... or any model-specific attribution