name: tool-reviewer description: Audit Agenvoy tool definitions (built-in Go tools, extensions/apis/.json, extensions/scripts//tool.json) against the project's tool design rules under the lazy-schema model — name clarity, description trigger coverage, schema field completeness, English-only text, and explicit defaults on optional fields. Use when the user wants to review tool quality, check tool design compliance, or audit api_/script_ extensions.
Tool Reviewer
Audits all Agenvoy tool definitions against the four design rules in project CLAUDE.md and emits a violation report.
Sources Audited
| Source | Path | Tool name prefix |
|---|---|---|
| Built-in | internal/tools/**/*.go (any toolRegister.Regist block) |
(varies) |
| API extensions | extensions/apis/*.json |
api_* |
| Script extensions | extensions/scripts/*/tool.json |
script_* |
Rules (mirrors CLAUDE.md "Tool 註冊" contract — lazy-schema model)
Lazy-schema context: every tool's name + description is always in LLM context; the parameters JSON schema is replaced with a stub {"type":"object","properties":{}} unless AlwaysLoad=true. The LLM decides whether to invoke a tool (or call search_tools to load its schema) using name + description alone. Schema is the call contract, loaded on demand.
- Name is self-explanatory — verb + noun, direct, distinguishable from siblings (e.g.
search_chat_history≻search_history). Name is the anchor; description elaborates the trigger. - Description is three signals, each one line — this is the LLM's only signal before the schema loads; bloat dilutes the trigger.
- Line 1 — What: what the tool does (core action, one sentence)
- Line 2 — When: when to use it vs alternatives (
use for X; Y for Z) - Line 3 — Precondition: key constraint or prerequisite the caller must satisfy before invoking (omit if none) Target length 60–200 chars total. What to avoid (filler dilutes selection signal):
- Padding phrases ("This tool allows you to...", "Use this when needed...")
**bold**/ markdown emphasis (token waste; structure ≠ semantics)- Output schema dumps (belongs in
parameters/ response examples, not selection text) - Implementation gossip (
uses readability under the hood) unrelated to selection - Call-contract details (type/unit/enum/default) — those belong in
parameters[*].description - Per-use-case bullet lists when a single sentence covers them Going past three lines usually means schema content leaked in.
- Parameter description is three signals — every entry in
parameters.propertiesmust carry adescriptioncovering:- How: what it controls, type, unit, accepted values (enum/regex/range)
- When: interaction with other params (omit if standalone)
- Example: at least one concrete value for non-trivial types (cron expression, file path, JSON body)
- English only —
description,parameters[*].description,enumtext must be English. CJK / mixed-language is a violation. (User-facing handler return strings may stay in Chinese.) - Optional fields require explicit
default— every parameter not inrequired[]must declare"default": <value>so the LLM knows the omission semantics. Required fields must NOT carrydefault.
Command Syntax
/tool-reviewer [PROJECT_PATH] [OUTPUT_FILE]
| Parameter | Default | Description |
|---|---|---|
PROJECT_PATH |
Current directory | Agenvoy repo root |
OUTPUT_FILE |
.doc/tool-reviewer/{yyyy-MM-dd_HH-mm}.md |
Report output (relative to PROJECT_PATH) |
Examples
/tool-reviewer # → .doc/tool-reviewer/2026-04-25_14-30.md
/tool-reviewer . # same
/tool-reviewer . custom.md # explicit override
Workflow
1. Scan → python3 scripts/scan_tools.py {PROJECT_PATH}
outputs JSON:
{ tools: [{source, name, description, parameters, required, file, line}, ...],
deterministic_violations: [{tool, rule, detail}, ...],
name_clusters: { <first_token>: [<tool>, ...] } ← anchor for R1 sibling review
}
2.A R1 sweep → Walk EVERY tool returned by the scan and write a one-line R1 verdict
(`pass` or `fail` + suggested rename). This step is mandatory — there is
no "skip if name looks fine" branch. Use `name_clusters` to compare
each tool against its siblings (same first token). Failing patterns:
• Generic verb (`process`, `handle`, `dispatch`, `verify`, ...)
• Verb inconsistency with siblings in the same cluster
(e.g. `analyze_X` when every other cluster member is `fetch_*`)
• Verb redundancy (`patch_edit` — `patch` already implies edit)
• Description carries semantic load that should be in the name
(e.g. `verify` whose description says "cross-review with external agents"
→ rename to `cross_review_with_external_agents`)
• Inconsistent suffix vocabulary across a cluster
(e.g. `read_tool_error` / `remember_error` / `search_error_history`
— same domain, three different shapes)
Verdicts are emitted in the report's `## Name Audit` section (see output_format.md).
2.B R2 sweep → Same enumeration discipline for description three-signal check. Re-read each
description and verify it has: What (core action), When (vs alternatives),
Precondition (if applicable). Flag as R2 violation if under-spec (missing
lines) or over-spec (>3 lines, filler, schema dumps). Suggest rewrite.
2.C R3 sweep → For each tool, walk every `parameters.properties` entry. Verify
three-signal structure: How (type/unit/values), When (param interaction),
Example (concrete value). Missing/single-word/name-repeat description,
non-trivial type without example → flag as R3 violation.
3. Gate → if zero deterministic + zero heuristic violations across all tools, skip Save
and print a one-line "no issues" message. Honor explicit OUTPUT_FILE override.
The Name Audit section must still be produced inside the report when one is
written, even if all verdicts are `pass` — coverage > brevity.
4. Save → mkdir -p {PROJECT_PATH}/.doc/tool-reviewer/ then write the report.
Deterministic Checks (handled by scripts/scan_tools.py)
The script flags these without LLM judgment — the LLM only needs to confirm and add context:
| Check | Trigger |
|---|---|
R1_DYNAMIC_NAME |
Name: field is a Go identifier the parser could not resolve to a same-file const literal |
R1_MIXED_SEPARATOR |
Tool name contains both _ and - (Agenvoy convention is snake_case) |
R1_GENERIC_VERB |
Name's first token is a generic verb (do, process, handle, manage, execute, perform, dispatch, run); see GENERIC_VERB_WHITELIST for justified exceptions |
R2_SHORT_DESC |
Description shorter than 60 characters — too thin to convey trigger signals under the lazy-schema model |
R2_BOLD_MARKDOWN |
Description contains **...** or __...__ (token waste; structure ≠ semantics) |
R3_PARAM_NO_DESC |
A properties entry has no description field (or empty/whitespace-only) |
R3_PARAM_SHORT_DESC |
Parameter description shorter than 20 characters AND type is non-trivial (object, array, or has enum) — likely missing examples/constraints |
R4_NON_ENGLISH_DESCRIPTION |
Tool description contains any CJK / Hangul / Hiragana / Katakana codepoint |
R4_NON_ENGLISH_PARAM |
Parameter description contains any CJK codepoint |
R5_OPTIONAL_NO_DEFAULT |
Parameter is not in required[] AND has no default key |
R5_REQUIRED_HAS_DEFAULT |
Parameter is in required[] AND has a default key (semantically wrong) |
Note: rules R2_NUMBERED_TRIGGER, R2_MULTI_PARAGRAPH, R2_TOOL_COMPARISON from earlier versions have been removed — trigger enumeration, multi-paragraph trigger context, and tool comparisons are now expected description content, not violations.
The scanner also emits name_clusters (tools grouped by first token after stripping api_ / script_ prefix) so the LLM-side R1 sweep has a concrete sibling list per cluster.
Heuristic Checks (LLM judgment)
For every tool the script returns — no skipping — apply these checks. Coverage is enforced by the Validation Checklist below: every tool must appear with a verdict in the report's ## Name Audit section.
- Name quality (R1): would another LLM, seeing only the name, correctly choose this tool over its siblings? Use
name_clustersfrom the scan output as the comparison anchor — same first token = sibling group. Suggest a better name on fail. Failure modes:- Generic verb the deterministic checker missed (e.g.
verify,query,inspectwhen not specific enough) - Verb inconsistency within a cluster (one tool uses
analyze_*while every sibling usesfetch_*) - Verb redundancy (
patch_edit,delete_remove_*) - Name buries the discriminator in description (
verifywhose description reveals it actually meanscross_review_with_external_agents) - Inconsistent suffix vocabulary across same-domain tools (
read_tool_error/remember_error/search_error_history— pick one shape)
- Generic verb the deterministic checker missed (e.g.
- Description three-signal check (R2): verify each description follows the three-line structure — What (core action) / When (vs alternatives) / Precondition (if any). Flag as R2 violation when:
- Under-spec: action-only ("Lists files") with no When line; missing contrast when sibling tools exist; missing precondition when one exists
- Over-spec: more than three lines; paragraph prose or bullet enumerations; filler ("This tool allows you to..."); output format dumps or call-contract details that belong in
parametersPropose a rewrite (60-200 chars, three lines max) for every fail.
- Parameter three-signal check (R3): walk every
propertiesentry. Verify each has: How (type/unit/values), When (param interaction, if applicable), Example (concrete value for non-trivial types). Flag as R3 violation when:descriptionmissing, single-word, or just repeats field name- Non-trivial type without a concrete example
enumlisted without per-value meaning- Numeric field without unit; param interaction undocumented
Reference Files
| Step | File | Purpose |
|---|---|---|
| Evaluate | scripts/review_rules.md |
Full rule definitions with positive / negative examples |
| Save | scripts/output_format.md |
Report structure template |
Validation Checklist
- All three sources scanned (built-in / api_ / script_)
- Every deterministic violation in the JSON appears in the report
- Name Audit section present and lists EVERY tool (count must equal
summary.tool_count); each row has an explicitpass/fail+suggestionverdict — no tool may be silently omitted - Every tool also received a description trigger-coverage review (R2) and a parameter schema completeness review (R3) — failures land in the per-source detail sections, passes are implied by absence
-
name_clustersfrom the scan output were consulted (cite at least one cluster comparison in any R1 fail entry) - Suggestions are concrete (proposed new name, proposed trimmed description), not abstract advice
- No-Op gate respected — if zero violations and no explicit
OUTPUT_FILE, skip the file (the gate covers detail sections; if a report IS written, the Name Audit section is mandatory) - Report grouped by source → severity → tool, not flat