semantic-duplication-scan - SKILL.md Agent Skill

name: semantic-duplication-scan description: >- Detect business logic reimplemented in multiple architectural layers. Builds a persistent computation-register.json by annotating non-trivial computation functions with structured semantic descriptions, then clusters entries to surface duplicate domain concepts. Runs in full-scan mode on first use, incremental (git-diff-based) mode on subsequent runs. Use when the user wants to find logical duplication that linters and diff-scoped review agents miss — the same domain calculation independently reimplemented across layers. role: worker user-invocable: true

Semantic Duplication Scan

Overview

Detect business logic that has been reimplemented multiple times across different architectural layers. Unlike linters (which detect syntactic similarity) or domain-review (which catches single-instance layer violations), this skill detects semantic equivalence — the same domain calculation independently appearing in domain services, client adapters, and presentation components with different variable names and structure.

Annotation Prompt Version

promptVersion: 1.0

When this version changes, any register entry with a different promptVersion is treated as stale and re-annotated on the next scan pass that touches that file.

Pre-Filter Rules

Apply before any LLM call. No model invocation at this stage.

Trivial Function Definition

A function is trivial — and must be excluded from the register — if it meets ALL of the following:

Contains no arithmetic operators: +, -, *, /, %, **
Contains no boolean logic operators: &&, ||, !, not, and, or
Contains no branching constructs: if, else, switch, case, ternary (?:), match
Contains no assignments to variables outside its own scope (no external state mutation)
Contains no calls to higher-order collection operations: map, filter, reduce, flatMap, forEach, find, some, every, or language equivalents

Trivial patterns (always excluded):

Getters: read and return a field with no transformation
Pass-through delegators: call one function with the same arguments, return the result unchanged
Identity functions: return the input unchanged
Constructors / initializers that only assign parameters to instance fields

If a file contains only trivial functions, output:

No computation units found to analyze

and do not create or modify the register.

File Exclusion Patterns

Exclude the following from annotation regardless of content:

*.test.*
*.spec.*
__tests__/
*.test-d.*
*.generated.*
*.pb.*
*.d.ts
dist/
build/
.next/
coverage/

Also exclude any path matching a pattern listed in .semanticscanignore (one glob per line) if that file exists in the project root.

Process Flow

Step 1 — Mode Detection

Check for computation-register.json in the project root:

Absent → full-scan mode
Present → incremental mode

Step 2 — Pre-Flight (Incremental Mode Only)

Run: git rev-parse --is-shallow-repository

If output is true:

Output the exact string: Shallow clone detected — semantic-scan requires full history for incremental mode. Run with --full to override.
Exit non-zero

If --full flag was passed: skip this check and force full-scan mode.

If lastScanCommit in the register is not found in git history:

Output: lastScanCommit not found in history — running full scan
Switch to full-scan mode

Step 3 — Scope Resolution

If a path argument was provided (e.g., /semantic-scan src/pricing), use it as a prefix filter: only consider files whose paths start with the argument
Apply .semanticscanignore patterns: exclude any file matching a listed glob
Apply file exclusion patterns from the Pre-Filter Rules above

Step 4 — File Selection

Full-scan mode: Glob all source files in the resolved scope.

Incremental mode: Run git diff <lastScanCommit> HEAD --name-only, then filter to files in the resolved scope.

If the git diff result is empty (no files changed since lastScanCommit):

Update lastScanCommit to HEAD in the register and write the updated register
Output: No changes since last scan — register up to date
Exit 0 — do not proceed to annotation or clustering

Step 5 — Pre-Filter

For each selected file, identify non-trivial computation functions using the Trivial Function Definition above. Apply without an LLM call — use structural heuristics (presence of operators, branches, higher-order calls).

If no non-trivial functions remain after filtering:

First run: No computation units found to analyze → exit 0, no register created
Incremental run: No new computation units found in changed files — register unchanged → exit 0, register not modified

Step 6 — Annotation (Haiku, file-level batching)

For each file with non-trivial functions:

Emit progress to stderr: Annotating [N/total] <filename>
Send all non-trivial functions from the file in a single Haiku call using the pinned prompt below
If the call fails, record {file, error} in scanErrors and continue — do not abort

Annotation prompt (pinned — do not paraphrase):

You are a semantic annotation assistant. For each function below, produce a JSON object describing what it computes in pure domain business terms.

Use this schema:
{
  "function": "<function name>",
  "layer": "<inferred layer — see rules below>",
  "semanticDescription": {
    "verb": "<lowercase infinitive verb>",
    "domainConcept": "<lowercase, no articles, normalized>",
    "inputs": ["<domain term>", ...],
    "outputConcept": "<domain term>"
  }
}

Layer inference rules — infer from what the function imports and uses:
- "infrastructure": imports DB clients, ORMs, HTTP clients, message brokers (pg, redis, axios, fetch, prisma, mongoose, etc.)
- "presentation": imports rendering primitives, formats for display, accesses DOM or templates (React, Vue, Svelte, JSX, HTML templates, etc.)
- "domain": depends only on domain types and pure functions, no external imports
- "application": orchestrates domain and infrastructure without owning business rules
- "unknown": cannot be determined from available context

domainConcept rules:
- Use lowercase
- Remove articles: a, an, the
- Normalize the verb to infinitive form
- Example: "calculates the discounted price" → domainConcept: "discounted price", verb: "calculate"

Describe only what the function computes in domain terms. Do not reference the implementation language, variable names, or data structure types.

Functions:
<paste function source here>

Canonicalize domainConcept after receiving the response:

Lowercase
Strip leading/trailing articles: a , an , the
Normalize verb in the verb field to infinitive (e.g., "calculates" → "calculate", "computing" → "compute")

Step 7 — Register Update

Build a register entry for each annotated function:

{
  "file": "<relative path from project root>",
  "function": "<function name>",
  "layer": "<inferred layer>",
  "semanticDescription": {
    "verb": "<canonicalized>",
    "domainConcept": "<canonicalized>",
    "inputs": ["<domain term>", ...],
    "outputConcept": "<domain term>"
  },
  "promptVersion": "1.0",
  "commitHash": "<HEAD commit hash>",
  "line": <first line of function definition>
}

Merge strategy:

Replace all entries whose file matches a re-annotated file
Remove entries whose file no longer exists on disk
Remove entries for files matching .semanticscanignore patterns
Preserve all other entries unchanged

Idempotency: Sort the full entry list by file ascending, then function ascending before writing.

Write the register. If the write fails (permissions, disk full):

Output the exact path that could not be written and the OS error
Exit non-zero

Update lastScanCommit to the current HEAD commit hash after successful write.

Report partial failures. After writing (only if scanErrors is non-empty):

N=1: Warning: 1 file could not be annotated. Re-run /semantic-scan to retry.
N>1: Warning: N files could not be annotated. Re-run /semantic-scan to retry.

Exit code 0 — partial success is not a failure.

Clustering

Token Budget and Partitioning

The full register may be too large to cluster in a single call. Use layer-pair sharding:

Shard the register by layer pair before sending to Sonnet:

domain × presentation
domain × infrastructure
application × presentation
application × infrastructure
domain × application

For each shard: if the shard exceeds 50,000 tokens, further shard by the first normalized token of domainConcept (split alphabetically into sub-groups).

After per-shard clustering, run a cross-shard reconciliation pass: send only the cluster representatives (one entry per cluster, the top-ranked candidate) back to Sonnet to catch equivalents that landed in different shards. Keep this merge input under 50,000 tokens.

Clustering Call (Sonnet)

For each shard, emit progress: Clustering [layer-pair]: <layer-a> × <layer-b>...

Clustering prompt (pinned — do not paraphrase):

Group these register entries by semantic equivalence — entries that compute the same domain concept regardless of implementation differences. Return clusters as JSON arrays of entry IDs (use "file:function" as the ID).

Two entries belong in the same cluster ONLY IF both would need to change if the underlying business rule changed.

Input entries:
<paste shard entries as JSON array>

Return only:
{ "clusters": [["file1:funcA", "file2:funcB"], ...] }

Entries that are semantically unique should not appear in any cluster.

Canonical Scoring

For each cluster with 2 or more entries:

Layer rank (higher = more canonical):

domain
application
presentation
infrastructure
unknown

Within the same layer rank: count infrastructure imports in the function's file; fewer imports = higher rank.

Ambiguity predicate: If the top two candidates tie on layer rank AND differ by ≤1 infrastructure import → escalate to Opus.

Before the Opus call, emit: Resolving ambiguous canonical for cluster: <domainConcept>...

Opus prompt:

Given these N entries that all compute the same domain concept, which one is the most appropriate canonical location for the business rule? Consider: domain purity, reusability, and least coupling to delivery mechanism. Return the file:function ID of the preferred canonical and a one-sentence rationale.

Entries:
<paste cluster entries>

If --no-opus flag was passed: skip Opus escalation. Report ambiguous clusters as:

canonical: ambiguous — human review required

Canonical Verdict Output

Clear winner (domain or application layer, no tie): canonical: suggested <file:line> — requires human confirmation
No winner (all infrastructure/unknown, or tie unresolved): canonical: none — a new domain-layer implementation may be required

Cross-scope notice (scoped runs only): If a cluster contains entries outside the scoped path:

Note: this cluster includes <N> entry/entries outside the scoped path — run without scope argument to see full context

(Use "entry" when N=1, "entries" when N>1.)

Report

After clustering is complete:

Duplicates found:

## Semantic Duplication Report

### Cluster: <domainConcept>

  - <file>:<line> [<layer>]  ← inferred canonical (or all entries if no canonical)
  - <file>:<line> [<layer>]

  canonical: suggested <file:line> — requires human confirmation
  [Note: this cluster includes N entry/entries outside the scoped path ...]

---

No duplicates:

No semantic duplication detected

file:line accuracy: Line numbers point to the first line of the function definition. If the file has been modified since annotation, append: (line may have shifted — re-run scan to refresh)