tech-architecture-review - SKILL.md Agent Skill

name: tech-architecture-review description: > Technology architecture assessment across the "-ilities" -- scalability, reliability, maintainability, security, observability, performance, and extensibility. USE THIS SKILL when the user asks about architecture reviews, scalability assessments, tech debt quantification, system design evaluation, modernization planning, cloud architecture review, or infrastructure assessment. Includes system decomposition, failure mode analysis, technical debt classification, and remediation roadmap generation with prioritized investments.

Technology Architecture Review

Required Inputs

System/Platform Name: What is being reviewed.
Review Scope: Full architecture, specific subsystem, or specific concern (scalability, security, etc.).
Business Context: Growth plans, performance requirements, regulatory constraints.
Available Documentation: Architecture diagrams, runbooks, ADRs, monitoring dashboards.
Access Level: Documentation only, read-only system access, or full environment access.
Success Criteria: What "good" looks like for this organization (SLAs, growth targets, compliance).

Execution Steps

1. System Decomposition Analysis

Map the entire system before assessing individual qualities.

Component Inventory:

Component	Type	Owner	Technology	Deployment	Criticality
	Service / Library / Database / Queue / Cache / CDN / Gateway	Team	Language, framework, version	Cloud/on-prem, containerized/VM	Critical / High / Medium / Low

Dependency Mapping:

Source Component	Target Component	Dependency Type	Communication	Failure Impact
		Synchronous / Asynchronous / Data	REST / gRPC / Event / DB / File	Cascading / Degraded / Isolated

Dependency Risk Classification:

Circular dependencies: Components with bidirectional synchronous calls (always Significant Risk)
Critical chain length: Longest synchronous call chain (>4 hops = performance and reliability risk)
Single points of failure: Components where failure cascades to >50% of system (Deal Breaker if no redundancy)
External dependencies: Third-party services without fallback (risk proportional to SLA gap)

System Boundary Diagram Requirements:

Data flow direction and volume (requests/sec, GB/day)
Authentication and authorization boundaries
Network boundaries (VPC, subnet, public/private)
Data classification zones (public, internal, confidential, restricted)

2. Architecture Assessment: The "-ilities"

Score each dimension on the 1-5 maturity scale with specific evidence.

2a. Scalability Assessment

Scaling Model Analysis:

Dimension	Current State	Assessment
Scaling direction	Vertical only / Horizontal / Auto-scaling
Statelessness	Stateful servers / Session affinity / Fully stateless
Database scaling	Single instance / Read replicas / Sharding / Distributed
Async processing	Synchronous only / Some queues / Event-driven / Full CQRS
Caching strategy	No caching / Application cache / Distributed cache / Multi-layer
CDN/Edge	No CDN / Static assets / Dynamic content / Edge compute

Capacity Planning Matrix:

Resource	Current Usage	Current Capacity	Headroom	2x Load	5x Load	10x Load	Bottleneck?
CPU (compute)
Memory
Database connections
Database IOPS
Network bandwidth
Storage
API rate limits (external)
Message queue throughput

Scalability Score Rubric:

Score	Characteristics
1 - Ad Hoc	Single server, vertical scaling only, no capacity planning, manual intervention required
2 - Managed	Some horizontal scaling, basic load balancing, reactive capacity management
3 - Defined	Auto-scaling configured, stateless services, read replicas, capacity monitoring
4 - Quantified	Load-tested regularly, predictive scaling, database sharding, multi-region capable
5 - Optimizing	Auto-scaling with cost optimization, edge computing, elastic everything, chaos-tested

2b. Reliability Assessment

SLA/SLO Evaluation:

Service	Current SLO	Achieved (12 months)	Target SLO	Gap	Business Impact of Downtime
	99.X%	99.X%	99.X%		$/hour or impact description

Failure Mode Analysis:

Component	Failure Mode	Probability	Impact	Detection	Recovery	Risk Score
	Crash / Slowdown / Data corruption / Network partition	H/M/L	H/M/L	Seconds / Minutes / Hours	Auto / Manual / None	P x I

Disaster Recovery Assessment:

Metric	Target	Actual	Gap	Finding Severity
Recovery Point Objective (RPO)
Recovery Time Objective (RTO)
Last DR test date	Quarterly
DR test success rate	100%
Backup verification	Daily
Cross-region replication	Active
Runbook completeness	100% of critical services

Reliability Score Rubric:

Score	Characteristics
1 - Ad Hoc	No SLOs, no redundancy, manual recovery, no backups tested
2 - Managed	Basic monitoring, some redundancy, backups exist but rarely tested
3 - Defined	SLOs defined, N+1 redundancy, automated failover for some services, regular backups
4 - Quantified	Error budgets tracked, multi-AZ, automated recovery, regular DR testing
5 - Optimizing	Chaos engineering, multi-region active-active, self-healing, zero-downtime deployments

2c. Maintainability Assessment

Dimension	Assessment Criteria	Score (1-5)	Evidence
Code clarity	Consistent style, naming conventions, low complexity
Documentation	Architecture docs, API docs, runbooks, onboarding guides
Test coverage	Unit, integration, E2E coverage; test quality, not just quantity
Dependency management	Dependency freshness, update cadence, vulnerability patching
Deployment process	CI/CD maturity, rollback capability, deployment frequency
Modularity	Separation of concerns, bounded contexts, loose coupling
Onboarding time	Time for new developer to make first meaningful contribution

Onboarding Time Benchmark:

Maturity	Time to First Commit	Time to Independent Contribution
Level 1	>4 weeks	>3 months
Level 3	1-2 weeks	1 month
Level 5	<1 week	<2 weeks

2d. Security Architecture Review

Defense in Depth Assessment:

Layer	Controls	Status	Finding
Perimeter	WAF, DDoS protection, rate limiting, IP allowlisting
Network	VPC segmentation, security groups, private subnets, VPN/bastion
Identity	SSO, MFA, OAuth2/OIDC, service mesh identity
Application	Input validation, output encoding, CSRF, secure headers
Data	Encryption at rest (AES-256), in transit (TLS 1.2+), key management
Monitoring	Security event logging, SIEM integration, threat detection
Response	Incident response plan, forensic capability, breach playbooks

Zero Trust Evaluation:

Principle	Implementation Status	Score (1-5)
Never trust, always verify	Mutual TLS, token validation at every service
Least privilege access	RBAC with minimal permissions, just-in-time access
Assume breach	Micro-segmentation, blast radius limitation, lateral movement detection
Explicit verification	Device health, user identity, location, behavior all checked
Continuous validation	Session re-evaluation, continuous authentication signals

Threat Modeling (STRIDE per component):

Component	Spoofing	Tampering	Repudiation	Info Disclosure	Denial of Service	Elevation of Privilege
	Risk / Mitigated	Risk / Mitigated	Risk / Mitigated	Risk / Mitigated	Risk / Mitigated	Risk / Mitigated

2e. Observability Assessment

Capability	Level 1: Ad Hoc	Level 3: Defined	Level 5: Optimizing
Logging	Console output, no aggregation	Centralized logging, structured format	Contextual logging, correlation IDs, anomaly detection
Metrics	Basic infra metrics (CPU, memory)	Application metrics, dashboards, alerts	Business metrics, SLO tracking, predictive alerting
Tracing	No distributed tracing	Tracing for some services	Full distributed tracing, trace-based testing
Alerting	Manual checks or basic uptime	Threshold alerts, on-call rotation	Intelligent alerting, runbook automation, low noise
Dashboards	None or ad hoc	Service-level dashboards	Unified observability platform, business and tech dashboards

2f. Performance Assessment

Performance Analysis Methodology:

Metric	Measurement Method	Current (p50 / p95 / p99)	Target	Status
API response time	APM / synthetic monitoring
Page load time	RUM / Lighthouse
Database query time	Query profiler / slow query log
Background job duration	Job monitoring
Throughput (req/sec)	Load balancer metrics
Error rate	Application metrics
Resource utilization	Infrastructure monitoring

Performance Anti-Pattern Detection:

Anti-Pattern	Detection Method	Impact	Remediation
N+1 queries	Query profiling, ORM analysis	Database overload, slow responses	Eager loading, query optimization
Synchronous external calls	Trace analysis, call graphs	Cascading latency, timeouts	Async patterns, circuit breakers
Missing indexes	Query explain plans, slow query log	Full table scans, high CPU	Index analysis and creation
Unbounded queries	Code review, query analysis	Memory exhaustion, timeouts	Pagination, query limits
Large payload transfers	Network analysis, API inspection	Bandwidth waste, slow responses	Compression, pagination, field selection
Missing caching	Cache hit ratio analysis	Redundant computation/queries	Cache strategy implementation
Connection pool exhaustion	Connection monitoring	Service unavailability	Pool sizing, connection management

2g. Extensibility Assessment

Dimension	Assessment Criteria	Score (1-5)	Evidence
API design	Versioning strategy, backward compatibility, documentation
Plugin/extension architecture	Ability to add functionality without core changes
Configuration management	Feature flags, environment-based config, runtime changes
Event system	Ability to react to system events without coupling
Multi-tenancy	Support for tenant isolation, customization per tenant

3. Technical Debt Classification and Quantification

Technical Debt Inventory:

ID	Debt Item	Category	Type	Impact (1-5)	Effort (1-5)	Priority Score	Status
TD-001		Code / Architecture / Infrastructure / Test / Documentation	Deliberate / Accidental			Impact x (6 - Effort)

Category Definitions:

Category	Examples	Typical Remediation
Code Debt	Duplicated code, high complexity, dead code, inconsistent patterns	Refactoring sprints, linting enforcement
Architecture Debt	Tight coupling, monolith bottleneck, wrong technology choice	Modularization, strangler fig migration
Infrastructure Debt	Manual provisioning, single region, legacy OS, no IaC	IaC migration, platform modernization
Test Debt	Low coverage, flaky tests, no integration tests, slow test suite	Test pyramid investment, test infrastructure
Documentation Debt	Missing architecture docs, stale runbooks, no API docs	Documentation sprints, doc-as-code
Dependency Debt	Outdated frameworks, EOL libraries, unpatched vulnerabilities	Dependency update program, migration plan

Technical Debt Quantification:

Remediation Cost = Effort (person-days) x Daily Rate
Carrying Cost = Weekly productivity impact (hours) x Hourly Rate x 52
Debt ROI = Annual Carrying Cost / Remediation Cost

Prioritize: Debt ROI > 2.0 = remediate immediately
            Debt ROI 1.0-2.0 = plan for next quarter
            Debt ROI < 1.0 = accept or defer

4. API Design and Integration Architecture Review

API Quality Assessment:

Dimension	Assessment Criteria	Score (1-5)
Design consistency	Naming conventions, URL patterns, error handling, pagination
Documentation	OpenAPI/Swagger spec, examples, changelog, developer portal
Versioning	Strategy (URL, header, content type), backward compatibility
Authentication	OAuth2, API keys, JWT, rate limiting, scopes
Error handling	Consistent error format, meaningful codes, actionable messages
Performance	Pagination, field selection, compression, caching headers
Testing	Contract tests, integration tests, consumer-driven contracts

Integration Architecture Patterns:

Pattern	Current Use	Appropriateness	Recommendation
Point-to-point REST
Event-driven (pub/sub)
API Gateway
Service mesh
Message queue
GraphQL federation
Batch/ETL
Webhooks

5. Cloud Architecture Assessment

Well-Architected Framework Alignment:

Pillar	AWS/Azure/GCP Best Practices	Current State	Gap	Recommendation
Operational Excellence	IaC, observability, incident management, continuous improvement
Security	IAM, encryption, network controls, compliance, detection
Reliability	Multi-AZ, auto-scaling, backup, DR, fault isolation
Performance Efficiency	Right-sizing, caching, CDN, database optimization
Cost Optimization	Reserved instances, right-sizing, tagging, waste elimination
Sustainability	Efficient resource usage, managed services, right-sizing

Cloud Anti-Pattern Detection:

Anti-Pattern	Description	Impact	Detection
Lift-and-shift without optimization	VMs in cloud without cloud-native redesign	Over-provisioned, expensive, fragile	Cost analysis, resource utilization
Single region deployment	All resources in one availability zone/region	DR risk, latency for distant users	Infrastructure inventory
Oversized instances	Resources provisioned for peak, never scaled down	30-60% cost waste	Utilization monitoring
Hardcoded configuration	Secrets, endpoints, config in code	Security risk, deployment rigidity	Code scanning
No tagging strategy	Resources untagged or inconsistently tagged	Cost allocation impossible, governance gaps	Tag audit
Orphaned resources	Unused disks, IPs, snapshots, load balancers	Cost waste	Resource audit

6. Modernization Pathway Options

Modernization Strategy Selection:

Strategy	Description	Risk	Timeline	Cost	Best For
Strangler Fig	Incrementally replace components, routing traffic to new system	Low	12-36 months	Moderate-High	Large monoliths with identifiable seams
Big Bang Rewrite	Complete rebuild and cutover	Very High	6-18 months	High	Small systems, unsalvageable architecture
Incremental Refactoring	Improve existing system without replacement	Low	Ongoing	Low-Moderate	Fundamentally sound architecture with debt
Platform Migration	Move to new platform (e.g., cloud) preserving logic	Medium	3-12 months	Moderate	Good architecture on wrong infrastructure
Encapsulate and Extend	Wrap legacy with APIs, build new features alongside	Low-Medium	3-6 months initial	Low-Moderate	Legacy system with stable core, new feature needs

Modernization Decision Framework:

Factor	Weight	Modernization Urgency
Business growth blocked by technology	25%
Security/compliance risk from legacy	20%
Maintenance cost escalating	20%
Talent unable to work with current stack	15%
Competitive disadvantage from tech limitations	10%
End-of-life dependencies	10%
Weighted Urgency Score	100%	/5.0

Urgency Interpretation:

Score < 2.0: Maintain and incrementally improve
Score 2.0-3.0: Plan modernization, start with highest-impact areas
Score 3.0-4.0: Prioritize modernization; significant business risk from delay
Score > 4.0: Urgent modernization required; allocate dedicated budget and team

7. Architecture Decision Records (ADR) Framework

ADR Template:

# ADR-[NNN]: [Decision Title]

**Status**: [Proposed / Accepted / Deprecated / Superseded]
**Date**: [YYYY-MM-DD]
**Decision Makers**: [Names and roles]

## Context
[What is the issue that we are seeing that motivates this decision?]

## Decision
[What is the change that we are proposing and/or doing?]

## Consequences
### Positive
- [Benefit 1]

### Negative
- [Trade-off 1]

### Risks
- [Risk with mitigation]

## Alternatives Considered
| Alternative | Pros | Cons | Reason Rejected |
|---|---|---|---|
| [Alt 1] | | | |

ADR Practice Assessment:

Dimension	Score (1-5)	Evidence
ADR adoption (% of significant decisions documented)
ADR discoverability (searchable, linked from code/docs)
ADR currency (reviewed and updated regularly)
Decision quality (alternatives considered, trade-offs explicit)

8. Remediation Roadmap

Prioritized Investment Framework:

Plot all findings on Impact (business value of fixing) vs. Effort (cost to fix):

Quadrant	Impact	Effort	Action
Quick Wins	High	Low	Do first (Weeks 1-4)
Strategic Projects	High	High	Plan and resource (Months 2-6)
Fill-Ins	Low	Low	Do when convenient
Deprioritize	Low	High	Accept or defer indefinitely

Remediation Phasing:

Phase	Timeline	Focus	Budget Allocation	Expected Outcomes
Stabilize	Months 1-3	Fix critical risks, close security gaps, improve monitoring	30%	Reduce incident frequency by 50%, close Deal Breaker and Significant Risk findings
Strengthen	Months 3-9	Address tech debt, improve test coverage, modernize CI/CD	40%	Improve deployment frequency, reduce mean time to recovery
Scale	Months 9-18	Architecture evolution, platform modernization, performance optimization	30%	Handle projected growth, improve developer productivity

Output Template

# Architecture Review: [System/Platform Name]

**Prepared for**: [Stakeholder] | **Date**: [Date] | **Scope**: [Full / Subsystem / Specific Concern]

## Executive Summary

**Overall Architecture Health**: [X.X] / 5.0 -- Level [N]: [Ad Hoc / Managed / Defined / Quantified / Optimizing]

[3-5 sentence summary: architecture strengths, critical gaps, and recommended
investment priorities. State whether the architecture supports projected business needs.]

### Architecture Scorecard
| Dimension | Score (1-5) | Status | Key Finding |
|---|---|---|---|
| Scalability | X.X | [Severity] | [One-line finding] |
| Reliability | X.X | [Severity] | [One-line finding] |
| Maintainability | X.X | [Severity] | [One-line finding] |
| Security | X.X | [Severity] | [One-line finding] |
| Observability | X.X | [Severity] | [One-line finding] |
| Performance | X.X | [Severity] | [One-line finding] |
| Extensibility | X.X | [Severity] | [One-line finding] |
| **Overall** | **X.X** | | |

### Finding Summary
| Severity | Count | Key Items |
|---|---|---|
| Deal Breaker | X | [Brief descriptions] |
| Significant Risk | X | [Brief descriptions] |
| Manageable | X | [Brief descriptions] |
| Non-Issue | X | -- |

### Estimated Remediation Investment: $[X]M - $[Y]M (18 months)

## 1. System Decomposition
**Total Components**: [X] | **Critical Dependencies**: [X] | **Single Points of Failure**: [X]
[Component inventory and dependency map]

## 2. Scalability Assessment
**Current Headroom**: [X]x before bottleneck
**Scaling Model**: [Vertical / Horizontal / Auto / Mixed]
[Capacity analysis and bottleneck identification]

## 3. Reliability Assessment
**Achieved Uptime**: [99.X%] vs. Target [99.X%]
**RPO**: [X] | **RTO**: [X] | **Last DR Test**: [Date]
[Failure mode analysis, DR assessment]

## 4. Security Architecture
**Defense in Depth Score**: [X.X]/5.0
**Zero Trust Maturity**: [X.X]/5.0
[Threat model findings, compliance gaps]

## 5. Maintainability & Technical Debt
**Tech Debt Ratio**: [X]% | **Estimated Remediation**: $[X]M
[Debt inventory, prioritized remediation]

## 6. Observability
**Observability Score**: [X.X]/5.0
[Gaps in logging, metrics, tracing, alerting]

## 7. Performance
**P95 Response Time**: [X]ms vs. Target [X]ms
[Performance analysis, anti-patterns identified]

## 8. Cloud Architecture
**Well-Architected Alignment**: [X.X]/5.0
[Anti-patterns identified, optimization opportunities]

## 9. Modernization Assessment
**Modernization Urgency**: [X.X]/5.0
**Recommended Strategy**: [Strangler Fig / Incremental / Platform Migration / Encapsulate]
[Modernization pathway with rationale]

## Technical Debt Register

| ID | Debt Item | Category | Impact | Effort | Debt ROI | Priority |
|---|---|---|---|---|---|---|
| TD-001 | [Description] | [Category] | [1-5] | [Person-days] | [X.X] | [Immediate/Q+1/Defer] |

## Remediation Roadmap

### Phase 1: Stabilize (Months 1-3) -- Budget: $[X]
| Initiative | Finding | Effort | Cost | Expected Outcome |
|---|---|---|---|---|
| [Initiative] | [Ref #] | [Person-weeks] | $[X] | [Measurable outcome] |

### Phase 2: Strengthen (Months 3-9) -- Budget: $[X]
| Initiative | Finding | Effort | Cost | Expected Outcome |
|---|---|---|---|---|

### Phase 3: Scale (Months 9-18) -- Budget: $[X]
| Initiative | Finding | Effort | Cost | Expected Outcome |
|---|---|---|---|---|

## Architecture Decision Recommendations

### ADR-001: [Decision Title]
**Recommendation**: [Proposed change]
**Rationale**: [Why this decision, what alternatives were considered]
**Trade-offs**: [What we give up]

## Appendix
- Component inventory (full)
- Dependency map (diagram)
- Performance test results
- Security scan results

Quality Checks

All seven "-ilities" assessed with individual scores and evidence.
System decomposition includes component inventory and dependency mapping.
Scalability assessed at 2x, 5x, and 10x current load with bottleneck identification.
Reliability analysis includes SLA/SLO evaluation, failure modes, and DR assessment.
Security review covers defense in depth, zero trust, and threat modeling (STRIDE).
Technical debt classified by category and type, quantified in dollar terms.
Debt ROI calculated to prioritize remediation investments.
API and integration architecture reviewed with pattern recommendations.
Cloud architecture assessed against well-architected framework.
Performance anti-patterns identified with specific remediation guidance.
Modernization pathway recommended with decision framework scoring.
ADR framework provided for ongoing architecture governance.
Remediation roadmap phased (Stabilize / Strengthen / Scale) with budget allocation.
All findings classified as Deal Breaker / Significant Risk / Manageable / Non-Issue.
Technology maturity model (1-5) applied consistently across all dimensions.
Cost estimates provided as ranges with confidence levels.