benchmark-translate - SKILL.md Agent Skill

name: benchmark-translate description: Run a quality benchmark of the /translate skill by selecting stratified test keys, capturing ground truth, translating, judging with sub-agents, and compiling a regression report. Invoke with /benchmark-translate. allowed-tools: Read, Write, Edit, Grep, Glob, Bash(node ), Bash(git checkout), Bash(git diff), Bash(git status), Bash(git rev-parse*), Task, Skill, AskUserQuestion

Translation Quality Benchmark

Measures the quality of the /translate skill by comparing its output against existing human translations. Uses stratified key selection with a fixed/rotating split, LLM judges, and programmatic validation to produce a comprehensive quality report with regression tracking across all 9 supported locales.

Data Artifacts

All benchmark data lives in scripts/translations/benchmark/ (gitignored):

File	Purpose
`testKeys.json`	Selected test keys with categories and `fixed` flag
`coreKeys.json`	Persistent core key set (stable across runs)
`ground-truth.json`	Captured human translations before removal
`report.json`	Latest benchmark report (becomes baseline on next run)
`baseline.json`	Previous report (auto-copied by setup.js)

Pipeline (7 Steps)

Step 1: Select Keys

node .claude/skills/benchmark-translate/scripts/select-keys.js [--count N] [--core N]

Selects N keys (default 150) stratified across 6 categories: glossary-term, financial-error, single-word, interpolation, defi-jargon, general. Validates all selected keys exist in en + all 9 locales.

Fixed/rotating split:

--core N (default 100): Number of fixed core keys for stable regression tracking
--count N (default 150): Total keys (core + rotating)
If coreKeys.json exists: loads it, validates keys still exist in all locales, tops up if needed
If coreKeys.json doesn't exist: selects core keys via stratified sampling and saves them
Remaining keys (default 50) are randomly selected as rotating keys from the non-core pool
Each entry in testKeys.json has "fixed": true (core) or "fixed": false (rotating)

Outputs scripts/translations/benchmark/testKeys.json.

Step 2: Setup

node .claude/skills/benchmark-translate/scripts/setup.js

If report.json exists from a previous run, copies it to baseline.json
Reads testKeys.json, captures ground truth translations for all 9 locales
Writes ground-truth.json
Removes test keys from locale files so /translate can regenerate them

Step 3: Translate

Invoke the /translate skill using the Skill tool. This regenerates the removed keys through the full translate-review-refine pipeline.

Step 4: Judge (Sub-Agents)

Launch 9 sub-agents in 3 waves of 3 (matching /translate's wave structure) using the Task tool. Each sub-agent receives the locale info, all key triplets, and glossary terms.

Wave 1: de, es, fr Wave 2: pt, ru, tr Wave 3: ja, uk, zh

For each locale, use this prompt:

You are an expert multilingual localization quality assessor for a cryptocurrency/DeFi application.
Rate translations from English into {LANGUAGE_NAME} on a 1-5 scale.

1 = Wrong/misleading meaning
2 = Significant issues (wrong register, missing nuance)
3 = Acceptable but could be more natural
4 = Good, natural, accurate
5 = Excellent, indistinguishable from professional native translation

Check: meaning preservation, naturalness, register ({REGISTER}), UI conciseness,
glossary compliance (these stay English: {NEVER_TRANSLATE_TERMS}),
placeholder integrity (%{...} preserved), DeFi terminology conventions.

Rate each translation INDEPENDENTLY. Community translations can contain errors.

Input: JSON array of {key, english, human, skill}
{ITEMS_JSON}

Output: Return ONLY a JSON array of objects with these exact fields:
{key, humanScore, skillScore, humanJustification, skillJustification, preferenceNote}

Scores must be integers 1-5. Justifications should be 1-2 sentences. preferenceNote should say which is better and why, or "tie" if equal.

Locale info for prompt substitution:

Locale	Language	Register
`de`	German	Formal (Sie)
`es`	Spanish	Informal (tú)
`fr`	French	Formal (vous)
`ja`	Japanese	Polite (です/ます)
`pt`	Portuguese	Informal (você)
`ru`	Russian	Formal (вы)
`tr`	Turkish	Formal (siz)
`uk`	Ukrainian	Formal (ви)
`zh`	Chinese (Simplified)	Neutral/formal

Building the items array for each locale:

Read scripts/translations/benchmark/ground-truth.json
Read the current (post-translate) src/assets/translations/{locale}/main.json
For each test key, build: { key: dottedPath, english: groundTruth.english[key], human: groundTruth.groundTruth[locale][key], skill: getValueFromLocaleFile(key) }

Getting never-translate terms: Read src/assets/translations/glossary.json, collect all keys where value is null (excluding _meta).

Each sub-agent must write its output to /tmp/{locale}-judge-scores.json. Parse the JSON array from the sub-agent's response and write it to that path.

Step 5: Compile Report

node .claude/skills/benchmark-translate/scripts/compile-report.js

Loads judge scores from /tmp/{locale}-judge-scores.json, runs programmatic validation (including Cyrillic script check for ru/uk), computes summary stats, and writes scripts/translations/benchmark/report.json. If baseline.json exists, includes regression deltas. Report includes coreSummary and rotatingSummary alongside the overall summary.

Step 6: Restore

node .claude/skills/benchmark-translate/scripts/restore.js

Restores locale files via git checkout --, verifies no diff remains.

Step 7: Present Results

Read the compile output (printed to stdout) and present to the user:

Overall score summary with baseline regression (if available)
Core vs rotating stats (divergence suggests overfitting)
Notable improvements/regressions
Per-locale and per-category highlights
Any items needing attention (low scores, validation failures, glossary issues)