code-quality-assessment - SKILL.md Agent Skill

name: code-quality-assessment description: Quantitative and qualitative code quality assessment with prioritized refactoring recommendations

REPORT_FILE = specs/architecture/code_quality_assessment.md

Target directories: If the target directory for REPORT_FILE does not exist, create it. The assessment is the first artifact that justifies the directory's existence.

Recommended Tools: Make sure you've read ~/.liza/AGENT_TOOLS.md list_directory_tree and codebase_search (fast and token-efficient semantic search) may be specifically useful.

Quality is not a binary. Measure it, grade it, and direct investment where it will compound.

Invoked for: periodic health checks, pre-refactoring assessment, onboarding orientation, or explicit code quality evaluation.

Process

Metrics Collection → Subsystem Analysis → Synthesis → Recommendations

Templates anchor cognition. Complete each phase before the next. The skill is a measurement framework to apply to what you found, not boxes to fill with platitudes.

Report format: See references/report-format.md for the output template.

Modes

Mode	When	Scope
Full Assessment	First assessment, periodic health check, major milestone	All phases, all sections, full report
Targeted Assessment	Evaluate specific subsystem(s) or concern area	Scoped metrics + analysis for named components only
Reassessment	After refactoring or significant changes	Delta comparison against previous REPORT_FILE
Enrichment	Improve coverage of existing assessment	Independent analysis → merge → verify → update
Quick Health Check	Verify existing findings still hold	Metrics refresh + finding verification only

Phase applicability:

Mode	Phase 1 (Metrics)	Phase 2 (Subsystem Analysis)	Phase 3 (Synthesis)	Phase 4 (Recs)	Output
Full Assessment	✓ Complete	✓ Complete	✓ Complete	✓ Complete	New REPORT_FILE
Targeted Assessment	Scoped	Scoped	Scoped	Scoped	Targeted section in REPORT_FILE
Reassessment	✓ Fresh	Delta comparison	Update	Update	Revised REPORT_FILE
Enrichment	✓ Fresh (independent)	Update (add to existing)	Update	Update	Revised REPORT_FILE
Quick Health Check	Refresh only	Verify only	— Skip	— Skip	Updated metrics + verification notes

Mode selection (first match wins):

User explicitly requests Targeted → Targeted Assessment
User explicitly requests Reassessment → Reassessment
User explicitly requests Quick Health Check → Quick Health Check
REPORT_FILE exists → Enrichment
No existing report → Full Assessment

Full Assessment

Use complete process: Phase 1 → Phase 2 → Phase 3 → Phase 4.

Time Budget: Phase 1 (Metrics) ~30% of effort. Phase 2 (Subsystem Analysis) ~40%. Phases 3+4 (Synthesis + Recommendations) ~30%. Most missed findings come from rushed metrics collection — especially the File Size Distribution scan. If you're tempted to skip ahead, you're under-investing in discovery.

Pairing checkpoint: After Phase 1, present the Metrics Dashboard and identified subsystems before proceeding to Phase 2. This catches scope gaps early (missed languages, wrong LOC counts, missing subsystems).

Default output: REPORT_FILE (if not specified and doesn't exist yet).

Targeted Assessment

Scope to named subsystems only. Collect metrics only for the targeted components.

Scope: Confirm which subsystems/directories to assess
Metrics: Collect Phase 1 metrics scoped to targeted paths only
Analysis: Full Phase 2 for targeted subsystems only
Output: If REPORT_FILE exists, update targeted sections within it. If not, create a partial report covering only assessed subsystems.

Reassessment

Requires existing REPORT_FILE.

Collect fresh metrics (full Phase 1)
For each subsystem, compare current metrics against previous
For each previous finding, verify: still relevant? severity changed? resolved?
Update ratings and grade
Highlight what changed. Tag changes: *(reassessment YYYY-MM-DD)*

Enrichment

Same anti-anchoring protocol as the software-architecture-review skill.

First check: Verify REPORT_FILE exists. If it doesn't, this is Full Assessment, not Enrichment.

Header check (BEFORE discovery): Read REPORT_FILE until you find the Mode: line to extract:

Pass number (e.g., "Mode: Enrichment (pass 3)")
Previous lens from header Your pass is N+1. Your lens continues the rotation from the previous lens. ⚠️ STOP as soon as you have the Mode line. Do NOT scroll down or read findings.

⚠️ CRITICAL: You MUST NOT read REPORT_FILE findings until Step 2. Reading findings early causes anchoring — you'll confirm existing findings instead of discovering new ones.

Process:

Independent Analysis (Phase 1 + Phase 2) — Complete as if no report exists. Explore the codebase fresh. Hold findings in memory. Do not read the existing report.
Merge Phase — Only now read REPORT_FILE. Compare your fresh findings against it.
Verification — For each finding in the existing report, verify:
- Still accurate? (code may have changed)
- Still relevant? (context may have shifted)
- Severity still appropriate?
- Mark each as: ✓ verified, ✗ stale, or ~ adjusted
Gap Analysis — List:
- New findings from fresh pass not in existing report
- Existing findings your fresh pass missed (and why)
Update — Revise REPORT_FILE with:
- New findings added, tagged *(pass N)* or *(pass N, [lens] lens)*
- Stale findings removed or marked resolved
- Severity adjustments where warranted
- Updated date and metrics in header

Time Budget: Independent Analysis (step 1) should be at least as thorough as merge + verification combined.

Lens Rotation

Each enrichment pass uses a different primary lens. Continue from the previous pass's lens.

Lenses:

Complexity — LOC, function length, god classes/scripts, cyclomatic complexity, design-level complexity patterns
Test Coverage — Test gaps, untested critical paths, test quality patterns
Dependencies — Dependency count, staleness, vendoring, transitive depth
Documentation — Doc coverage, staleness risk, spec-code drift
CI/Build — Pipeline completeness, enforcement gaps, build hygiene

Rotation order: Complexity → Dependencies → CI/Build → Test Coverage → Documentation → (wrap to Complexity)

The first 3 passes cover the highest-value lenses (Complexity, Dependencies, CI/Build) as primary.

How to apply: During Phase 1+2, start with your primary lens. Spend ~40% of discovery time on it before broadening. The leading lens gets deepest attention while context is fresh.

Complexity lens — systematic scan: When Complexity is your primary lens:

Structural scan (start here): Run the LOC scan from Phase 1.3. Flag ALL files >500 LOC as potential god classes. For each, investigate before merge phase.
Design-level scan (after structural): For each complex function or file identified, ask "why is this complex?" before recommending how to fix it:
- Boolean-flag dispatch: Function resolves a type/variant into boolean flags, then threads them through multiple phases. Fix: strategy/polymorphism, not extraction.
- Imperative ceremony: Repetitive code that follows a pattern but isn't expressed as data. Fix: declarative definitions + loop, not file split.
- Responsibility mixing: File contains unrelated concerns sharing no state. Fix: file split (this is the one case where extraction IS the right answer).
- Deep nesting: Complex conditionals that could be flattened. Fix: early returns, guard clauses, or table-driven dispatch.
The structural scan finds where complexity lives. The design-level scan identifies what kind of complexity it is. Different kinds need different remedies — recommending extraction for a design problem addresses the symptom, not the cause.

Enrichment Iteration Guidance

Recommended: Run enrichment 3 times for solid coverage. Additional passes provide diminishing returns.

⚠️ MANDATORY after 3+ passes: If pass number ≥ 3, present options before proceeding:

Pass [N] exists ([previous lens] lens). Per skill, 3 passes provide solid coverage.

Options:
1. Pass [N+1] Enrichment ([next lens] lens) — full independent discovery + merge
2. Reassessment — fresh metrics + delta comparison
3. Quick Health Check — verify existing findings still hold

Which approach?

Quick Health Check

Fastest mode. No new discovery — verification only.

Refresh Phase 1 metrics
Compare against REPORT_FILE metrics — note any significant changes
For each existing finding, verify against current codebase state
Mark findings as: ✓ verified, ✗ stale, ~ adjusted
Update header date and metrics
Do NOT explore for new findings

Phase 1: Metrics Collection

Quantitative backbone of the assessment. Language-agnostic.

1.1 Language Detection

Scan for manifest files to determine primary language(s):

Manifest	Language
`go.mod`	Go
`package.json`	JavaScript/TypeScript
`pyproject.toml`, `requirements.txt`, `setup.py`	Python
`Cargo.toml`	Rust
`pom.xml`, `build.gradle`, `build.gradle.kts`	Java/Kotlin
`.csproj`, `.sln`	C#/.NET
`Gemfile`	Ruby
`mix.exs`	Elixir

Multi-language projects: collect metrics per language, report the primary language first.

1.2 Repository Overview

Collect per language (prefer cloc, scc, or tokei when available; fall back to wc -l):

Production code: LOC by language (exclude tests, vendored, generated)
Test code: LOC across test files
Test-to-production ratio: test LOC / production LOC
Test count: number of test functions/cases
Documentation: LOC across docs/specs/README files
Dependencies: direct count from manifest files

Language-specific collection hints:

Language	Production LOC	Test Files	Test Count	Dependencies
Go	`.go` excluding `_test.go`	`*_test.go`	`rg "func Test"`	`require` blocks in go.mod
Python	`.py` excluding `test_`, `*_test.py`	`test_.py`, `_test.py`	`rg "def test_"`	pyproject.toml / requirements.txt
TS/JS	`.ts`, `.js` excluding `.test.`, `.spec.`, `node_modules/`	`.test.`, `.spec.`	`rg 'it(	test('`
Rust	`*.rs` excluding `tests/`	`tests/`, `#[cfg(test)]` modules	`rg '#\[test\]'`	`[dependencies]` in Cargo.toml

These are approximations for order-of-magnitude assessment, not precision tooling.

1.3 File Size Distribution

Identify files exceeding thresholds. List top 20 largest files:

# Go
find . -name "*.go" ! -name "*_test.go" ! -path "*/vendor/*" -exec wc -l {} + | sort -rn | head -20

# Python
find . -name "*.py" ! -path "*/__pycache__/*" ! -path "*/venv/*" ! -path "*/.venv/*" -exec wc -l {} + | sort -rn | head -20

# TypeScript/JavaScript
find . \( -name "*.ts" -o -name "*.tsx" -o -name "*.js" -o -name "*.jsx" \) ! -path "*/node_modules/*" -print0 | xargs -0 wc -l | sort -rn | head -20

# Rust
find . -name "*.rs" ! -path "*/target/*" -exec wc -l {} + | sort -rn | head -20

# Or use cloc/scc/tokei if available — they handle exclusions automatically

Flag files >500 LOC as potential god classes/scripts
Flag files >300 LOC as candidates for investigation

1.4 Dependency Assessment

Count direct vs transitive dependencies (where tooling supports it)
Note dependency minimalism or bloat relative to ecosystem norms
Check for known problematic patterns: vendored copies of actively-maintained deps, pinned ancient versions

1.5 CI/CD Assessment

Check for CI config: .github/workflows/, .gitlab-ci.yml, Jenkinsfile, Makefile, etc.
Check for pre-commit: .pre-commit-config.yaml
Note what CI covers: lint, test, build, coverage, deploy
Check for coverage enforcement (threshold gates, not just reporting)

1.6 Code Hygiene Scan

Scan for patterns that indicate quality discipline or gaps:

Magic literals: Search for hardcoded string values used in control flow, event dispatch, identity comparison, or configuration. Language-specific patterns:

Language	Scan Approach
Go	`rg -n '"[a-z_]"' -g '.go' -g '!*_test.go'` filtered to control flow contexts
Python	`rg -n "['\"]\w+['\"]" -g '*.py'` in if/match/dispatch contexts
TS/JS	`rg -n "['\"]\w+['\"]" -g '*.ts'` in switch/if/event contexts

Not every string literal is a magic value. Focus on:

Strings used in dispatch (switch/if chains, event names, status checks)
Strings used as identity (agent IDs, role names, service names)
Strings that appear in multiple files (cross-module coupling via literal)
Strings that shadow typed constants (the type exists but literals bypass it)

Provenance classification — for each category of magic literal found, classify:

Category	Example	Fix
System constant	Event name, error code	Extract to typed constant
Configuration value	Default port, timeout	Extract to config with default
User-supplied identity	Agent ID, workspace name	Resolve from runtime state — a constant doesn't fix this

The third category is the most severe: a hardcoded value that should be dynamic means the system silently assumes a specific runtime configuration. A const only consolidates the assumption; it doesn't fix it.

Suppression markers: Count nolint, noqa, @ts-ignore, # type: ignore, eslint-disable
Panic/exit calls: panic(), os.Exit(), process.exit(), sys.exit() in non-main code
Untyped escape hatches: interface{}, any, Any, object in production code
TODO/FIXME/HACK: Count and assess whether tracked or abandoned

1.7 Metrics Dashboard

Assemble findings into the dashboard format from references/report-format.md.

Phase 2: Subsystem Analysis

Qualitative assessment of each component.

2.1 Subsystem Identification

Walk the project structure. Identify natural subsystem boundaries:

Top-level directories with distinct purposes
Package/module boundaries
Layers (domain, application, infrastructure, presentation) if identifiable

In Pairing mode: confirm identified subsystems before proceeding.

2.2 Rating Framework

Each subsystem gets a 1–5 star rating:

Stars	Meaning
★★★★★	Exemplary — clean, well-tested, minimal concerns
★★★★☆	Strong — solid engineering with minor concerns
★★★☆☆	Adequate — functional but meaningful improvement opportunities
★★☆☆☆	Concerning — significant issues affecting maintainability or reliability
★☆☆☆☆	Critical — serious problems requiring immediate attention

Rating dimensions (weight equally unless context justifies otherwise):

Code organization: Cohesion, responsibility separation, file sizes
Test coverage: Quality and depth of testing for this subsystem
API clarity: Are interfaces and contracts clear?
Error handling: How are failures managed?
Complexity management: Is complexity proportional to the problem? When complexity is high, is it structural (file/function size — fixable by splitting) or design-level (wrong abstraction, boolean-flag dispatch, imperative ceremony — requires pattern change)? Rate lower when complexity has a design root cause, even if LOC is acceptable.

2.3 Analysis Template

For each subsystem:

### [Subsystem Name] (`path/`) ★★★★☆

**Strengths:**
- [Specific strength with evidence]

**Concerns:**
- [Specific concern with evidence — file names, LOC counts, concrete observations]

Discipline: Every strength and concern must cite evidence. "Well-tested" is not a strength — "2.5:1 test-to-production ratio with table-driven subtests" is. "Large file" is not a concern — "handlers.go at 918 LOC mixing 30+ handler functions" is.

2.4 Cross-Cutting Sections

After subsystem analysis, assess these quality dimensions that span subsystems. Include each section only when there is substantive content — an empty section is worse than an absent one.

Testing & Quality Infrastructure:

Test framework and patterns (stdlib vs third-party, table-driven, fixtures, etc.)
Coverage approach (reporting, enforcement, thresholds)
Test quality enforcement (parallel tests, race detection, sleep guards, test helper isolation)
Integration/E2E test presence and coverage

Pre-Commit & CI Pipeline:

Hook coverage by category (lint, format, type-check, security, file hygiene)
CI stages and platforms
Enforcement gaps (what runs but doesn't block, what doesn't run at all)

Documentation & Specifications:

Doc volume and coverage
Spec depth (if specs/ exists)
Staleness risk (specs referencing outdated implementation details)
Onboarding path clarity

Phase 3: Synthesis

Aggregate findings into overall assessment.

3.1 Executive Summary

One paragraph: what the project is and its overall engineering quality.

Key Strengths (3–5 bullet points): The most impactful positive patterns. Synthesize — don't enumerate. "Clean architecture" is not a strength; "Strict layer separation with dependencies pointing inward — no infrastructure types leak into domain" is.

Areas for Improvement (3–5 bullet points): The most impactful concerns. Same evidence discipline.

3.2 Overall Grade

Grade	Meaning
A+	Exceptional — exemplary across all dimensions; teaching reference quality
A	Excellent — strong across all dimensions; concerns are minor
A-	Excellent with concerns — strong foundation, meaningful structural or coverage gaps
B+	Good — solid engineering; several areas need attention
B	Good with gaps — functional and maintainable; notable testing or structural gaps
B-	Adequate — works but shows systematic underinvestment in quality
C+	Below expectations — multiple significant concerns; improvement needed before scaling
C	Concerning — serious quality issues affecting reliability or maintainability
C-	Poor — widespread problems; refactoring required before feature work
D	Critical — fundamental issues; quality debt threatens project viability
F	Failing — unmaintainable; rebuild considerations warranted

Grading discipline:

Grade must be justified with specific reference to subsystem ratings and metrics
State the deduction from the next-higher grade explicitly (e.g., "The deduction from A to A- is for file-level concentration in the MCP handlers and state model")
Calibrate to ecosystem norms (see Reference: Language Ecosystem Norms)

Phase 4: Recommendations

Prioritized refactoring roadmap.

4.1 Priority Framework

Priority	Criteria	Typical Actions
P1: High Impact / Low Risk	Structural improvements that don't change behavior. Clear, safe, high ROI.	File splits, module extraction, grouping, adding missing CI gates, extracting typed constants from magic literals
P2: Medium Impact / Medium Risk	Quality improvements requiring broader changes.	Coverage enforcement, test additions, API cleanup, dependency updates, design pattern introduction (strategy, declarative registration), resolving hardcoded identities
P3: Strategic / Long-term	Investments that compound over time. May require architecture changes.	Fuzz testing, spec-code automation, tooling, major decompositions

4.2 Recommendation Template

For each recommendation:

#### N.M [Title]
- **What**: [Specific files/components to change and how]
- **Risk**: [Low / Medium / High] — [rationale]
- **Impact**: [What improves and why it matters]
- **Depends on**: [Other recommendations, if any]

Every recommendation must trace to a finding in Phase 2 or Phase 3. No generic "best practices" without project-specific justification.

4.3 Recommendation Anti-Patterns

Do not recommend what the project already does well
Do not recommend architectural rewrites when structural cleanup suffices
Do not recommend without stating the concrete problem it addresses
Do not recommend file splits or method extraction as the remedy for every complexity concern. Ask "why is this complex?" first — if the root cause is a design issue (boolean-flag dispatch, imperative ceremony, missing polymorphism), recommend the design-level fix. Extraction addresses structural complexity; it does not fix design complexity. If you find yourself recommending only structural remedies, you are likely missing design-level concerns.
Sequence matters: P1 should be achievable independently; P2 may depend on P1

Persistence of Findings

ISSUES_FILE = specs/architecture/architectural-issues.md

Significant findings (subsystem concerns rated ★★☆☆☆ or below, P1 recommendations, cross-cutting concerns) should be persisted to ISSUES_FILE for long-term tracking.

Persistence format:

### [Issue Title]

**Skill:** code-quality-assessment
**Category:** [Subsystem concern / Cross-cutting / RECOMMENDATION]

**Issue:** [Description]

**Implication:** [Why it matters]

**Direction:** [Suggested approach, if any]

What to persist:

Subsystem concerns with ★★☆☆☆ or below
All P1 recommendations
Cross-cutting concerns with systemic impact

What NOT to persist:

Low-priority style issues
Findings already in ISSUES_FILE (check before adding)
Transient issues resolved in the same session

Scope Constraints

This skill assesses the whole repository, not individual diffs. Scope constraints apply to persistence (what gets written to ISSUES_FILE), not to analysis (what gets examined).

Liza mode (multi-agent):

Full repo assessment is the default — the skill exists for whole-codebase evaluation
When invoked during a review task (not a standalone assessment), scope findings to what's relevant to the review context
Persist only findings not already in ISSUES_FILE

Pairing mode:

Do not re-raise issues already documented in ISSUES_FILE unless materially changed
If changes worsen a documented issue, update the existing entry rather than duplicating

Mode-Specific Confirmation

Pairing mode: Before saving findings to ISSUES_FILE:

Found [N] quality issues worth persisting:
1. [Issue title] — [one-line summary]
2. ...

Save to specs/architecture/architectural-issues.md? (y/n/select specific)

Wait for user confirmation.

Liza mode: Save automatically after assessment completion.

Integration

Skill	Relationship
software-architecture-review	Complementary. Code quality assesses health metrics and grading; architecture review assesses structural patterns, smells, and dependency direction. For deeper structural analysis, invoke architecture review.
testing	Downstream. Testing skill provides detailed test methodology; code quality assessment provides the bird's-eye view of testing adequacy.
clean-code	Downstream. Code quality assessment identifies refactoring targets; clean-code executes the transformations.
code-review	Orthogonal. Code review evaluates diffs; code quality evaluates the whole codebase. Assessment findings provide context for reviewers.

Reference: Metrics Calibration

Healthy ranges for calibrating assessments. These are norms, not targets.

Metric	Healthy Range	Warning Signs
Test-to-production ratio	0.5:1 – 3:1	<0.3:1 (undertested), >5:1 (possibly testing implementation details)
Max file LOC	<500	>500 (god class candidate), >800 (almost certainly needs splitting)
Direct dependencies	Varies by ecosystem	>50 for a focused tool; >200 for any project
CI coverage enforcement	Present	Absent when test ratio is healthy (culture without enforcement)
TODOs in production code	0 ideal	>10 untracked (deferred maintenance)
Pre-commit hooks	Present	None configured in a team project
Magic literals in dispatch	0 (all typed constants)	>5 untyped strings in control flow (typo risk, no IDE support)

Reference: Language Ecosystem Norms

Calibration varies by ecosystem. Anchoring to wrong norms produces misleading grades.

Language	Typical Test Approach	Dependency Norms	File Size Norms
Go	Table-driven `t.Run()` subtests; stdlib testing	Minimal (stdlib-first)	Packages <500 LOC typical
Python	pytest; fixtures-heavy	pip ecosystem; moderate deps normal	Modules <300 LOC typical
TypeScript/JS	Jest/Vitest; mock-heavy	npm ecosystem; high dep count normal	Components <200 LOC typical
Rust	`#[test]` modules; property testing via proptest	Cargo ecosystem; moderate deps	Modules <500 LOC typical
Java/Kotlin	JUnit; Spring test	Maven/Gradle; high deps normal	Classes <300 LOC typical
C#/.NET	xUnit/NUnit; mock-heavy	NuGet; moderate deps	Classes <300 LOC typical
Ruby	RSpec/Minitest; fixtures	Bundler; moderate deps	Classes <200 LOC typical
Elixir	ExUnit; doctests	Hex; minimal deps typical	Modules <300 LOC typical

A Go project with 50 dependencies is notable; a TypeScript project with 50 is unremarkable. Grade relative to ecosystem.