tech-architecture-review

star 3

Technology architecture assessment across the "-ilities" -- scalability, reliability, maintainability, security, observability, performance, and extensibility. USE THIS SKILL when the user asks about architecture reviews, scalability assessments, tech debt quantification, system design evaluation, modernization planning, cloud architecture review, or infrastructure assessment. Includes system decomposition, failure mode analysis, technical debt classification, and remediation roadmap generation with prioritized investments.

Kaakati By Kaakati schedule Updated 3/1/2026

name: tech-architecture-review description: > Technology architecture assessment across the "-ilities" -- scalability, reliability, maintainability, security, observability, performance, and extensibility. USE THIS SKILL when the user asks about architecture reviews, scalability assessments, tech debt quantification, system design evaluation, modernization planning, cloud architecture review, or infrastructure assessment. Includes system decomposition, failure mode analysis, technical debt classification, and remediation roadmap generation with prioritized investments.

Technology Architecture Review

Required Inputs

  • System/Platform Name: What is being reviewed.
  • Review Scope: Full architecture, specific subsystem, or specific concern (scalability, security, etc.).
  • Business Context: Growth plans, performance requirements, regulatory constraints.
  • Available Documentation: Architecture diagrams, runbooks, ADRs, monitoring dashboards.
  • Access Level: Documentation only, read-only system access, or full environment access.
  • Success Criteria: What "good" looks like for this organization (SLAs, growth targets, compliance).

Execution Steps

1. System Decomposition Analysis

Map the entire system before assessing individual qualities.

Component Inventory:

Component Type Owner Technology Deployment Criticality
Service / Library / Database / Queue / Cache / CDN / Gateway Team Language, framework, version Cloud/on-prem, containerized/VM Critical / High / Medium / Low

Dependency Mapping:

Source Component Target Component Dependency Type Communication Failure Impact
Synchronous / Asynchronous / Data REST / gRPC / Event / DB / File Cascading / Degraded / Isolated

Dependency Risk Classification:

  • Circular dependencies: Components with bidirectional synchronous calls (always Significant Risk)
  • Critical chain length: Longest synchronous call chain (>4 hops = performance and reliability risk)
  • Single points of failure: Components where failure cascades to >50% of system (Deal Breaker if no redundancy)
  • External dependencies: Third-party services without fallback (risk proportional to SLA gap)

System Boundary Diagram Requirements:

  • Data flow direction and volume (requests/sec, GB/day)
  • Authentication and authorization boundaries
  • Network boundaries (VPC, subnet, public/private)
  • Data classification zones (public, internal, confidential, restricted)

2. Architecture Assessment: The "-ilities"

Score each dimension on the 1-5 maturity scale with specific evidence.

2a. Scalability Assessment

Scaling Model Analysis:

Dimension Current State Assessment
Scaling direction Vertical only / Horizontal / Auto-scaling
Statelessness Stateful servers / Session affinity / Fully stateless
Database scaling Single instance / Read replicas / Sharding / Distributed
Async processing Synchronous only / Some queues / Event-driven / Full CQRS
Caching strategy No caching / Application cache / Distributed cache / Multi-layer
CDN/Edge No CDN / Static assets / Dynamic content / Edge compute

Capacity Planning Matrix:

Resource Current Usage Current Capacity Headroom 2x Load 5x Load 10x Load Bottleneck?
CPU (compute)
Memory
Database connections
Database IOPS
Network bandwidth
Storage
API rate limits (external)
Message queue throughput

Scalability Score Rubric:

Score Characteristics
1 - Ad Hoc Single server, vertical scaling only, no capacity planning, manual intervention required
2 - Managed Some horizontal scaling, basic load balancing, reactive capacity management
3 - Defined Auto-scaling configured, stateless services, read replicas, capacity monitoring
4 - Quantified Load-tested regularly, predictive scaling, database sharding, multi-region capable
5 - Optimizing Auto-scaling with cost optimization, edge computing, elastic everything, chaos-tested

2b. Reliability Assessment

SLA/SLO Evaluation:

Service Current SLO Achieved (12 months) Target SLO Gap Business Impact of Downtime
99.X% 99.X% 99.X% $/hour or impact description

Failure Mode Analysis:

Component Failure Mode Probability Impact Detection Recovery Risk Score
Crash / Slowdown / Data corruption / Network partition H/M/L H/M/L Seconds / Minutes / Hours Auto / Manual / None P x I

Disaster Recovery Assessment:

Metric Target Actual Gap Finding Severity
Recovery Point Objective (RPO)
Recovery Time Objective (RTO)
Last DR test date Quarterly
DR test success rate 100%
Backup verification Daily
Cross-region replication Active
Runbook completeness 100% of critical services

Reliability Score Rubric:

Score Characteristics
1 - Ad Hoc No SLOs, no redundancy, manual recovery, no backups tested
2 - Managed Basic monitoring, some redundancy, backups exist but rarely tested
3 - Defined SLOs defined, N+1 redundancy, automated failover for some services, regular backups
4 - Quantified Error budgets tracked, multi-AZ, automated recovery, regular DR testing
5 - Optimizing Chaos engineering, multi-region active-active, self-healing, zero-downtime deployments

2c. Maintainability Assessment

Dimension Assessment Criteria Score (1-5) Evidence
Code clarity Consistent style, naming conventions, low complexity
Documentation Architecture docs, API docs, runbooks, onboarding guides
Test coverage Unit, integration, E2E coverage; test quality, not just quantity
Dependency management Dependency freshness, update cadence, vulnerability patching
Deployment process CI/CD maturity, rollback capability, deployment frequency
Modularity Separation of concerns, bounded contexts, loose coupling
Onboarding time Time for new developer to make first meaningful contribution

Onboarding Time Benchmark:

Maturity Time to First Commit Time to Independent Contribution
Level 1 >4 weeks >3 months
Level 3 1-2 weeks 1 month
Level 5 <1 week <2 weeks

2d. Security Architecture Review

Defense in Depth Assessment:

Layer Controls Status Finding
Perimeter WAF, DDoS protection, rate limiting, IP allowlisting
Network VPC segmentation, security groups, private subnets, VPN/bastion
Identity SSO, MFA, OAuth2/OIDC, service mesh identity
Application Input validation, output encoding, CSRF, secure headers
Data Encryption at rest (AES-256), in transit (TLS 1.2+), key management
Monitoring Security event logging, SIEM integration, threat detection
Response Incident response plan, forensic capability, breach playbooks

Zero Trust Evaluation:

Principle Implementation Status Score (1-5)
Never trust, always verify Mutual TLS, token validation at every service
Least privilege access RBAC with minimal permissions, just-in-time access
Assume breach Micro-segmentation, blast radius limitation, lateral movement detection
Explicit verification Device health, user identity, location, behavior all checked
Continuous validation Session re-evaluation, continuous authentication signals

Threat Modeling (STRIDE per component):

Component Spoofing Tampering Repudiation Info Disclosure Denial of Service Elevation of Privilege
Risk / Mitigated Risk / Mitigated Risk / Mitigated Risk / Mitigated Risk / Mitigated Risk / Mitigated

2e. Observability Assessment

Capability Level 1: Ad Hoc Level 3: Defined Level 5: Optimizing Score
Logging Console output, no aggregation Centralized logging, structured format Contextual logging, correlation IDs, anomaly detection
Metrics Basic infra metrics (CPU, memory) Application metrics, dashboards, alerts Business metrics, SLO tracking, predictive alerting
Tracing No distributed tracing Tracing for some services Full distributed tracing, trace-based testing
Alerting Manual checks or basic uptime Threshold alerts, on-call rotation Intelligent alerting, runbook automation, low noise
Dashboards None or ad hoc Service-level dashboards Unified observability platform, business and tech dashboards

2f. Performance Assessment

Performance Analysis Methodology:

Metric Measurement Method Current (p50 / p95 / p99) Target Status
API response time APM / synthetic monitoring
Page load time RUM / Lighthouse
Database query time Query profiler / slow query log
Background job duration Job monitoring
Throughput (req/sec) Load balancer metrics
Error rate Application metrics
Resource utilization Infrastructure monitoring

Performance Anti-Pattern Detection:

Anti-Pattern Detection Method Impact Remediation
N+1 queries Query profiling, ORM analysis Database overload, slow responses Eager loading, query optimization
Synchronous external calls Trace analysis, call graphs Cascading latency, timeouts Async patterns, circuit breakers
Missing indexes Query explain plans, slow query log Full table scans, high CPU Index analysis and creation
Unbounded queries Code review, query analysis Memory exhaustion, timeouts Pagination, query limits
Large payload transfers Network analysis, API inspection Bandwidth waste, slow responses Compression, pagination, field selection
Missing caching Cache hit ratio analysis Redundant computation/queries Cache strategy implementation
Connection pool exhaustion Connection monitoring Service unavailability Pool sizing, connection management

2g. Extensibility Assessment

Dimension Assessment Criteria Score (1-5) Evidence
API design Versioning strategy, backward compatibility, documentation
Plugin/extension architecture Ability to add functionality without core changes
Configuration management Feature flags, environment-based config, runtime changes
Event system Ability to react to system events without coupling
Multi-tenancy Support for tenant isolation, customization per tenant

3. Technical Debt Classification and Quantification

Technical Debt Inventory:

ID Debt Item Category Type Impact (1-5) Effort (1-5) Priority Score Status
TD-001 Code / Architecture / Infrastructure / Test / Documentation Deliberate / Accidental Impact x (6 - Effort)

Category Definitions:

Category Examples Typical Remediation
Code Debt Duplicated code, high complexity, dead code, inconsistent patterns Refactoring sprints, linting enforcement
Architecture Debt Tight coupling, monolith bottleneck, wrong technology choice Modularization, strangler fig migration
Infrastructure Debt Manual provisioning, single region, legacy OS, no IaC IaC migration, platform modernization
Test Debt Low coverage, flaky tests, no integration tests, slow test suite Test pyramid investment, test infrastructure
Documentation Debt Missing architecture docs, stale runbooks, no API docs Documentation sprints, doc-as-code
Dependency Debt Outdated frameworks, EOL libraries, unpatched vulnerabilities Dependency update program, migration plan

Technical Debt Quantification:

Remediation Cost = Effort (person-days) x Daily Rate
Carrying Cost = Weekly productivity impact (hours) x Hourly Rate x 52
Debt ROI = Annual Carrying Cost / Remediation Cost

Prioritize: Debt ROI > 2.0 = remediate immediately
            Debt ROI 1.0-2.0 = plan for next quarter
            Debt ROI < 1.0 = accept or defer

4. API Design and Integration Architecture Review

API Quality Assessment:

Dimension Assessment Criteria Score (1-5)
Design consistency Naming conventions, URL patterns, error handling, pagination
Documentation OpenAPI/Swagger spec, examples, changelog, developer portal
Versioning Strategy (URL, header, content type), backward compatibility
Authentication OAuth2, API keys, JWT, rate limiting, scopes
Error handling Consistent error format, meaningful codes, actionable messages
Performance Pagination, field selection, compression, caching headers
Testing Contract tests, integration tests, consumer-driven contracts

Integration Architecture Patterns:

Pattern Current Use Appropriateness Recommendation
Point-to-point REST
Event-driven (pub/sub)
API Gateway
Service mesh
Message queue
GraphQL federation
Batch/ETL
Webhooks

5. Cloud Architecture Assessment

Well-Architected Framework Alignment:

Pillar AWS/Azure/GCP Best Practices Current State Gap Recommendation
Operational Excellence IaC, observability, incident management, continuous improvement
Security IAM, encryption, network controls, compliance, detection
Reliability Multi-AZ, auto-scaling, backup, DR, fault isolation
Performance Efficiency Right-sizing, caching, CDN, database optimization
Cost Optimization Reserved instances, right-sizing, tagging, waste elimination
Sustainability Efficient resource usage, managed services, right-sizing

Cloud Anti-Pattern Detection:

Anti-Pattern Description Impact Detection
Lift-and-shift without optimization VMs in cloud without cloud-native redesign Over-provisioned, expensive, fragile Cost analysis, resource utilization
Single region deployment All resources in one availability zone/region DR risk, latency for distant users Infrastructure inventory
Oversized instances Resources provisioned for peak, never scaled down 30-60% cost waste Utilization monitoring
Hardcoded configuration Secrets, endpoints, config in code Security risk, deployment rigidity Code scanning
No tagging strategy Resources untagged or inconsistently tagged Cost allocation impossible, governance gaps Tag audit
Orphaned resources Unused disks, IPs, snapshots, load balancers Cost waste Resource audit

6. Modernization Pathway Options

Modernization Strategy Selection:

Strategy Description Risk Timeline Cost Best For
Strangler Fig Incrementally replace components, routing traffic to new system Low 12-36 months Moderate-High Large monoliths with identifiable seams
Big Bang Rewrite Complete rebuild and cutover Very High 6-18 months High Small systems, unsalvageable architecture
Incremental Refactoring Improve existing system without replacement Low Ongoing Low-Moderate Fundamentally sound architecture with debt
Platform Migration Move to new platform (e.g., cloud) preserving logic Medium 3-12 months Moderate Good architecture on wrong infrastructure
Encapsulate and Extend Wrap legacy with APIs, build new features alongside Low-Medium 3-6 months initial Low-Moderate Legacy system with stable core, new feature needs

Modernization Decision Framework:

Factor Weight Current Architecture Score (1-5) Modernization Urgency
Business growth blocked by technology 25%
Security/compliance risk from legacy 20%
Maintenance cost escalating 20%
Talent unable to work with current stack 15%
Competitive disadvantage from tech limitations 10%
End-of-life dependencies 10%
Weighted Urgency Score 100% /5.0

Urgency Interpretation:

  • Score < 2.0: Maintain and incrementally improve
  • Score 2.0-3.0: Plan modernization, start with highest-impact areas
  • Score 3.0-4.0: Prioritize modernization; significant business risk from delay
  • Score > 4.0: Urgent modernization required; allocate dedicated budget and team

7. Architecture Decision Records (ADR) Framework

ADR Template:

# ADR-[NNN]: [Decision Title]

**Status**: [Proposed / Accepted / Deprecated / Superseded]
**Date**: [YYYY-MM-DD]
**Decision Makers**: [Names and roles]

## Context
[What is the issue that we are seeing that motivates this decision?]

## Decision
[What is the change that we are proposing and/or doing?]

## Consequences
### Positive
- [Benefit 1]

### Negative
- [Trade-off 1]

### Risks
- [Risk with mitigation]

## Alternatives Considered
| Alternative | Pros | Cons | Reason Rejected |
|---|---|---|---|
| [Alt 1] | | | |

ADR Practice Assessment:

Dimension Score (1-5) Evidence
ADR adoption (% of significant decisions documented)
ADR discoverability (searchable, linked from code/docs)
ADR currency (reviewed and updated regularly)
Decision quality (alternatives considered, trade-offs explicit)

8. Remediation Roadmap

Prioritized Investment Framework:

Plot all findings on Impact (business value of fixing) vs. Effort (cost to fix):

Quadrant Impact Effort Action
Quick Wins High Low Do first (Weeks 1-4)
Strategic Projects High High Plan and resource (Months 2-6)
Fill-Ins Low Low Do when convenient
Deprioritize Low High Accept or defer indefinitely

Remediation Phasing:

Phase Timeline Focus Budget Allocation Expected Outcomes
Stabilize Months 1-3 Fix critical risks, close security gaps, improve monitoring 30% Reduce incident frequency by 50%, close Deal Breaker and Significant Risk findings
Strengthen Months 3-9 Address tech debt, improve test coverage, modernize CI/CD 40% Improve deployment frequency, reduce mean time to recovery
Scale Months 9-18 Architecture evolution, platform modernization, performance optimization 30% Handle projected growth, improve developer productivity

Output Template

# Architecture Review: [System/Platform Name]

**Prepared for**: [Stakeholder] | **Date**: [Date] | **Scope**: [Full / Subsystem / Specific Concern]

## Executive Summary

**Overall Architecture Health**: [X.X] / 5.0 -- Level [N]: [Ad Hoc / Managed / Defined / Quantified / Optimizing]

[3-5 sentence summary: architecture strengths, critical gaps, and recommended
investment priorities. State whether the architecture supports projected business needs.]

### Architecture Scorecard
| Dimension | Score (1-5) | Status | Key Finding |
|---|---|---|---|
| Scalability | X.X | [Severity] | [One-line finding] |
| Reliability | X.X | [Severity] | [One-line finding] |
| Maintainability | X.X | [Severity] | [One-line finding] |
| Security | X.X | [Severity] | [One-line finding] |
| Observability | X.X | [Severity] | [One-line finding] |
| Performance | X.X | [Severity] | [One-line finding] |
| Extensibility | X.X | [Severity] | [One-line finding] |
| **Overall** | **X.X** | | |

### Finding Summary
| Severity | Count | Key Items |
|---|---|---|
| Deal Breaker | X | [Brief descriptions] |
| Significant Risk | X | [Brief descriptions] |
| Manageable | X | [Brief descriptions] |
| Non-Issue | X | -- |

### Estimated Remediation Investment: $[X]M - $[Y]M (18 months)

## 1. System Decomposition
**Total Components**: [X] | **Critical Dependencies**: [X] | **Single Points of Failure**: [X]
[Component inventory and dependency map]

## 2. Scalability Assessment
**Current Headroom**: [X]x before bottleneck
**Scaling Model**: [Vertical / Horizontal / Auto / Mixed]
[Capacity analysis and bottleneck identification]

## 3. Reliability Assessment
**Achieved Uptime**: [99.X%] vs. Target [99.X%]
**RPO**: [X] | **RTO**: [X] | **Last DR Test**: [Date]
[Failure mode analysis, DR assessment]

## 4. Security Architecture
**Defense in Depth Score**: [X.X]/5.0
**Zero Trust Maturity**: [X.X]/5.0
[Threat model findings, compliance gaps]

## 5. Maintainability & Technical Debt
**Tech Debt Ratio**: [X]% | **Estimated Remediation**: $[X]M
[Debt inventory, prioritized remediation]

## 6. Observability
**Observability Score**: [X.X]/5.0
[Gaps in logging, metrics, tracing, alerting]

## 7. Performance
**P95 Response Time**: [X]ms vs. Target [X]ms
[Performance analysis, anti-patterns identified]

## 8. Cloud Architecture
**Well-Architected Alignment**: [X.X]/5.0
[Anti-patterns identified, optimization opportunities]

## 9. Modernization Assessment
**Modernization Urgency**: [X.X]/5.0
**Recommended Strategy**: [Strangler Fig / Incremental / Platform Migration / Encapsulate]
[Modernization pathway with rationale]

## Technical Debt Register

| ID | Debt Item | Category | Impact | Effort | Debt ROI | Priority |
|---|---|---|---|---|---|---|
| TD-001 | [Description] | [Category] | [1-5] | [Person-days] | [X.X] | [Immediate/Q+1/Defer] |

## Remediation Roadmap

### Phase 1: Stabilize (Months 1-3) -- Budget: $[X]
| Initiative | Finding | Effort | Cost | Expected Outcome |
|---|---|---|---|---|
| [Initiative] | [Ref #] | [Person-weeks] | $[X] | [Measurable outcome] |

### Phase 2: Strengthen (Months 3-9) -- Budget: $[X]
| Initiative | Finding | Effort | Cost | Expected Outcome |
|---|---|---|---|---|

### Phase 3: Scale (Months 9-18) -- Budget: $[X]
| Initiative | Finding | Effort | Cost | Expected Outcome |
|---|---|---|---|---|

## Architecture Decision Recommendations

### ADR-001: [Decision Title]
**Recommendation**: [Proposed change]
**Rationale**: [Why this decision, what alternatives were considered]
**Trade-offs**: [What we give up]

## Appendix
- Component inventory (full)
- Dependency map (diagram)
- Performance test results
- Security scan results

Quality Checks

  • All seven "-ilities" assessed with individual scores and evidence.
  • System decomposition includes component inventory and dependency mapping.
  • Scalability assessed at 2x, 5x, and 10x current load with bottleneck identification.
  • Reliability analysis includes SLA/SLO evaluation, failure modes, and DR assessment.
  • Security review covers defense in depth, zero trust, and threat modeling (STRIDE).
  • Technical debt classified by category and type, quantified in dollar terms.
  • Debt ROI calculated to prioritize remediation investments.
  • API and integration architecture reviewed with pattern recommendations.
  • Cloud architecture assessed against well-architected framework.
  • Performance anti-patterns identified with specific remediation guidance.
  • Modernization pathway recommended with decision framework scoring.
  • ADR framework provided for ongoing architecture governance.
  • Remediation roadmap phased (Stabilize / Strengthen / Scale) with budget allocation.
  • All findings classified as Deal Breaker / Significant Risk / Manageable / Non-Issue.
  • Technology maturity model (1-5) applied consistently across all dimensions.
  • Cost estimates provided as ranges with confidence levels.
Install via CLI
npx skills add https://github.com/Kaakati/managing-director --skill tech-architecture-review
Repository Details
star Stars 3
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator