name: tech-architecture-review description: > Technology architecture assessment across the "-ilities" -- scalability, reliability, maintainability, security, observability, performance, and extensibility. USE THIS SKILL when the user asks about architecture reviews, scalability assessments, tech debt quantification, system design evaluation, modernization planning, cloud architecture review, or infrastructure assessment. Includes system decomposition, failure mode analysis, technical debt classification, and remediation roadmap generation with prioritized investments.
Technology Architecture Review
Required Inputs
- System/Platform Name: What is being reviewed.
- Review Scope: Full architecture, specific subsystem, or specific concern (scalability, security, etc.).
- Business Context: Growth plans, performance requirements, regulatory constraints.
- Available Documentation: Architecture diagrams, runbooks, ADRs, monitoring dashboards.
- Access Level: Documentation only, read-only system access, or full environment access.
- Success Criteria: What "good" looks like for this organization (SLAs, growth targets, compliance).
Execution Steps
1. System Decomposition Analysis
Map the entire system before assessing individual qualities.
Component Inventory:
| Component | Type | Owner | Technology | Deployment | Criticality |
|---|---|---|---|---|---|
| Service / Library / Database / Queue / Cache / CDN / Gateway | Team | Language, framework, version | Cloud/on-prem, containerized/VM | Critical / High / Medium / Low |
Dependency Mapping:
| Source Component | Target Component | Dependency Type | Communication | Failure Impact |
|---|---|---|---|---|
| Synchronous / Asynchronous / Data | REST / gRPC / Event / DB / File | Cascading / Degraded / Isolated |
Dependency Risk Classification:
- Circular dependencies: Components with bidirectional synchronous calls (always Significant Risk)
- Critical chain length: Longest synchronous call chain (>4 hops = performance and reliability risk)
- Single points of failure: Components where failure cascades to >50% of system (Deal Breaker if no redundancy)
- External dependencies: Third-party services without fallback (risk proportional to SLA gap)
System Boundary Diagram Requirements:
- Data flow direction and volume (requests/sec, GB/day)
- Authentication and authorization boundaries
- Network boundaries (VPC, subnet, public/private)
- Data classification zones (public, internal, confidential, restricted)
2. Architecture Assessment: The "-ilities"
Score each dimension on the 1-5 maturity scale with specific evidence.
2a. Scalability Assessment
Scaling Model Analysis:
| Dimension | Current State | Assessment |
|---|---|---|
| Scaling direction | Vertical only / Horizontal / Auto-scaling | |
| Statelessness | Stateful servers / Session affinity / Fully stateless | |
| Database scaling | Single instance / Read replicas / Sharding / Distributed | |
| Async processing | Synchronous only / Some queues / Event-driven / Full CQRS | |
| Caching strategy | No caching / Application cache / Distributed cache / Multi-layer | |
| CDN/Edge | No CDN / Static assets / Dynamic content / Edge compute |
Capacity Planning Matrix:
| Resource | Current Usage | Current Capacity | Headroom | 2x Load | 5x Load | 10x Load | Bottleneck? |
|---|---|---|---|---|---|---|---|
| CPU (compute) | |||||||
| Memory | |||||||
| Database connections | |||||||
| Database IOPS | |||||||
| Network bandwidth | |||||||
| Storage | |||||||
| API rate limits (external) | |||||||
| Message queue throughput |
Scalability Score Rubric:
| Score | Characteristics |
|---|---|
| 1 - Ad Hoc | Single server, vertical scaling only, no capacity planning, manual intervention required |
| 2 - Managed | Some horizontal scaling, basic load balancing, reactive capacity management |
| 3 - Defined | Auto-scaling configured, stateless services, read replicas, capacity monitoring |
| 4 - Quantified | Load-tested regularly, predictive scaling, database sharding, multi-region capable |
| 5 - Optimizing | Auto-scaling with cost optimization, edge computing, elastic everything, chaos-tested |
2b. Reliability Assessment
SLA/SLO Evaluation:
| Service | Current SLO | Achieved (12 months) | Target SLO | Gap | Business Impact of Downtime |
|---|---|---|---|---|---|
| 99.X% | 99.X% | 99.X% | $/hour or impact description |
Failure Mode Analysis:
| Component | Failure Mode | Probability | Impact | Detection | Recovery | Risk Score |
|---|---|---|---|---|---|---|
| Crash / Slowdown / Data corruption / Network partition | H/M/L | H/M/L | Seconds / Minutes / Hours | Auto / Manual / None | P x I |
Disaster Recovery Assessment:
| Metric | Target | Actual | Gap | Finding Severity |
|---|---|---|---|---|
| Recovery Point Objective (RPO) | ||||
| Recovery Time Objective (RTO) | ||||
| Last DR test date | Quarterly | |||
| DR test success rate | 100% | |||
| Backup verification | Daily | |||
| Cross-region replication | Active | |||
| Runbook completeness | 100% of critical services |
Reliability Score Rubric:
| Score | Characteristics |
|---|---|
| 1 - Ad Hoc | No SLOs, no redundancy, manual recovery, no backups tested |
| 2 - Managed | Basic monitoring, some redundancy, backups exist but rarely tested |
| 3 - Defined | SLOs defined, N+1 redundancy, automated failover for some services, regular backups |
| 4 - Quantified | Error budgets tracked, multi-AZ, automated recovery, regular DR testing |
| 5 - Optimizing | Chaos engineering, multi-region active-active, self-healing, zero-downtime deployments |
2c. Maintainability Assessment
| Dimension | Assessment Criteria | Score (1-5) | Evidence |
|---|---|---|---|
| Code clarity | Consistent style, naming conventions, low complexity | ||
| Documentation | Architecture docs, API docs, runbooks, onboarding guides | ||
| Test coverage | Unit, integration, E2E coverage; test quality, not just quantity | ||
| Dependency management | Dependency freshness, update cadence, vulnerability patching | ||
| Deployment process | CI/CD maturity, rollback capability, deployment frequency | ||
| Modularity | Separation of concerns, bounded contexts, loose coupling | ||
| Onboarding time | Time for new developer to make first meaningful contribution |
Onboarding Time Benchmark:
| Maturity | Time to First Commit | Time to Independent Contribution |
|---|---|---|
| Level 1 | >4 weeks | >3 months |
| Level 3 | 1-2 weeks | 1 month |
| Level 5 | <1 week | <2 weeks |
2d. Security Architecture Review
Defense in Depth Assessment:
| Layer | Controls | Status | Finding |
|---|---|---|---|
| Perimeter | WAF, DDoS protection, rate limiting, IP allowlisting | ||
| Network | VPC segmentation, security groups, private subnets, VPN/bastion | ||
| Identity | SSO, MFA, OAuth2/OIDC, service mesh identity | ||
| Application | Input validation, output encoding, CSRF, secure headers | ||
| Data | Encryption at rest (AES-256), in transit (TLS 1.2+), key management | ||
| Monitoring | Security event logging, SIEM integration, threat detection | ||
| Response | Incident response plan, forensic capability, breach playbooks |
Zero Trust Evaluation:
| Principle | Implementation Status | Score (1-5) |
|---|---|---|
| Never trust, always verify | Mutual TLS, token validation at every service | |
| Least privilege access | RBAC with minimal permissions, just-in-time access | |
| Assume breach | Micro-segmentation, blast radius limitation, lateral movement detection | |
| Explicit verification | Device health, user identity, location, behavior all checked | |
| Continuous validation | Session re-evaluation, continuous authentication signals |
Threat Modeling (STRIDE per component):
| Component | Spoofing | Tampering | Repudiation | Info Disclosure | Denial of Service | Elevation of Privilege |
|---|---|---|---|---|---|---|
| Risk / Mitigated | Risk / Mitigated | Risk / Mitigated | Risk / Mitigated | Risk / Mitigated | Risk / Mitigated |
2e. Observability Assessment
| Capability | Level 1: Ad Hoc | Level 3: Defined | Level 5: Optimizing | Score |
|---|---|---|---|---|
| Logging | Console output, no aggregation | Centralized logging, structured format | Contextual logging, correlation IDs, anomaly detection | |
| Metrics | Basic infra metrics (CPU, memory) | Application metrics, dashboards, alerts | Business metrics, SLO tracking, predictive alerting | |
| Tracing | No distributed tracing | Tracing for some services | Full distributed tracing, trace-based testing | |
| Alerting | Manual checks or basic uptime | Threshold alerts, on-call rotation | Intelligent alerting, runbook automation, low noise | |
| Dashboards | None or ad hoc | Service-level dashboards | Unified observability platform, business and tech dashboards |
2f. Performance Assessment
Performance Analysis Methodology:
| Metric | Measurement Method | Current (p50 / p95 / p99) | Target | Status |
|---|---|---|---|---|
| API response time | APM / synthetic monitoring | |||
| Page load time | RUM / Lighthouse | |||
| Database query time | Query profiler / slow query log | |||
| Background job duration | Job monitoring | |||
| Throughput (req/sec) | Load balancer metrics | |||
| Error rate | Application metrics | |||
| Resource utilization | Infrastructure monitoring |
Performance Anti-Pattern Detection:
| Anti-Pattern | Detection Method | Impact | Remediation |
|---|---|---|---|
| N+1 queries | Query profiling, ORM analysis | Database overload, slow responses | Eager loading, query optimization |
| Synchronous external calls | Trace analysis, call graphs | Cascading latency, timeouts | Async patterns, circuit breakers |
| Missing indexes | Query explain plans, slow query log | Full table scans, high CPU | Index analysis and creation |
| Unbounded queries | Code review, query analysis | Memory exhaustion, timeouts | Pagination, query limits |
| Large payload transfers | Network analysis, API inspection | Bandwidth waste, slow responses | Compression, pagination, field selection |
| Missing caching | Cache hit ratio analysis | Redundant computation/queries | Cache strategy implementation |
| Connection pool exhaustion | Connection monitoring | Service unavailability | Pool sizing, connection management |
2g. Extensibility Assessment
| Dimension | Assessment Criteria | Score (1-5) | Evidence |
|---|---|---|---|
| API design | Versioning strategy, backward compatibility, documentation | ||
| Plugin/extension architecture | Ability to add functionality without core changes | ||
| Configuration management | Feature flags, environment-based config, runtime changes | ||
| Event system | Ability to react to system events without coupling | ||
| Multi-tenancy | Support for tenant isolation, customization per tenant |
3. Technical Debt Classification and Quantification
Technical Debt Inventory:
| ID | Debt Item | Category | Type | Impact (1-5) | Effort (1-5) | Priority Score | Status |
|---|---|---|---|---|---|---|---|
| TD-001 | Code / Architecture / Infrastructure / Test / Documentation | Deliberate / Accidental | Impact x (6 - Effort) |
Category Definitions:
| Category | Examples | Typical Remediation |
|---|---|---|
| Code Debt | Duplicated code, high complexity, dead code, inconsistent patterns | Refactoring sprints, linting enforcement |
| Architecture Debt | Tight coupling, monolith bottleneck, wrong technology choice | Modularization, strangler fig migration |
| Infrastructure Debt | Manual provisioning, single region, legacy OS, no IaC | IaC migration, platform modernization |
| Test Debt | Low coverage, flaky tests, no integration tests, slow test suite | Test pyramid investment, test infrastructure |
| Documentation Debt | Missing architecture docs, stale runbooks, no API docs | Documentation sprints, doc-as-code |
| Dependency Debt | Outdated frameworks, EOL libraries, unpatched vulnerabilities | Dependency update program, migration plan |
Technical Debt Quantification:
Remediation Cost = Effort (person-days) x Daily Rate
Carrying Cost = Weekly productivity impact (hours) x Hourly Rate x 52
Debt ROI = Annual Carrying Cost / Remediation Cost
Prioritize: Debt ROI > 2.0 = remediate immediately
Debt ROI 1.0-2.0 = plan for next quarter
Debt ROI < 1.0 = accept or defer
4. API Design and Integration Architecture Review
API Quality Assessment:
| Dimension | Assessment Criteria | Score (1-5) |
|---|---|---|
| Design consistency | Naming conventions, URL patterns, error handling, pagination | |
| Documentation | OpenAPI/Swagger spec, examples, changelog, developer portal | |
| Versioning | Strategy (URL, header, content type), backward compatibility | |
| Authentication | OAuth2, API keys, JWT, rate limiting, scopes | |
| Error handling | Consistent error format, meaningful codes, actionable messages | |
| Performance | Pagination, field selection, compression, caching headers | |
| Testing | Contract tests, integration tests, consumer-driven contracts |
Integration Architecture Patterns:
| Pattern | Current Use | Appropriateness | Recommendation |
|---|---|---|---|
| Point-to-point REST | |||
| Event-driven (pub/sub) | |||
| API Gateway | |||
| Service mesh | |||
| Message queue | |||
| GraphQL federation | |||
| Batch/ETL | |||
| Webhooks |
5. Cloud Architecture Assessment
Well-Architected Framework Alignment:
| Pillar | AWS/Azure/GCP Best Practices | Current State | Gap | Recommendation |
|---|---|---|---|---|
| Operational Excellence | IaC, observability, incident management, continuous improvement | |||
| Security | IAM, encryption, network controls, compliance, detection | |||
| Reliability | Multi-AZ, auto-scaling, backup, DR, fault isolation | |||
| Performance Efficiency | Right-sizing, caching, CDN, database optimization | |||
| Cost Optimization | Reserved instances, right-sizing, tagging, waste elimination | |||
| Sustainability | Efficient resource usage, managed services, right-sizing |
Cloud Anti-Pattern Detection:
| Anti-Pattern | Description | Impact | Detection |
|---|---|---|---|
| Lift-and-shift without optimization | VMs in cloud without cloud-native redesign | Over-provisioned, expensive, fragile | Cost analysis, resource utilization |
| Single region deployment | All resources in one availability zone/region | DR risk, latency for distant users | Infrastructure inventory |
| Oversized instances | Resources provisioned for peak, never scaled down | 30-60% cost waste | Utilization monitoring |
| Hardcoded configuration | Secrets, endpoints, config in code | Security risk, deployment rigidity | Code scanning |
| No tagging strategy | Resources untagged or inconsistently tagged | Cost allocation impossible, governance gaps | Tag audit |
| Orphaned resources | Unused disks, IPs, snapshots, load balancers | Cost waste | Resource audit |
6. Modernization Pathway Options
Modernization Strategy Selection:
| Strategy | Description | Risk | Timeline | Cost | Best For |
|---|---|---|---|---|---|
| Strangler Fig | Incrementally replace components, routing traffic to new system | Low | 12-36 months | Moderate-High | Large monoliths with identifiable seams |
| Big Bang Rewrite | Complete rebuild and cutover | Very High | 6-18 months | High | Small systems, unsalvageable architecture |
| Incremental Refactoring | Improve existing system without replacement | Low | Ongoing | Low-Moderate | Fundamentally sound architecture with debt |
| Platform Migration | Move to new platform (e.g., cloud) preserving logic | Medium | 3-12 months | Moderate | Good architecture on wrong infrastructure |
| Encapsulate and Extend | Wrap legacy with APIs, build new features alongside | Low-Medium | 3-6 months initial | Low-Moderate | Legacy system with stable core, new feature needs |
Modernization Decision Framework:
| Factor | Weight | Current Architecture Score (1-5) | Modernization Urgency |
|---|---|---|---|
| Business growth blocked by technology | 25% | ||
| Security/compliance risk from legacy | 20% | ||
| Maintenance cost escalating | 20% | ||
| Talent unable to work with current stack | 15% | ||
| Competitive disadvantage from tech limitations | 10% | ||
| End-of-life dependencies | 10% | ||
| Weighted Urgency Score | 100% | /5.0 |
Urgency Interpretation:
- Score < 2.0: Maintain and incrementally improve
- Score 2.0-3.0: Plan modernization, start with highest-impact areas
- Score 3.0-4.0: Prioritize modernization; significant business risk from delay
- Score > 4.0: Urgent modernization required; allocate dedicated budget and team
7. Architecture Decision Records (ADR) Framework
ADR Template:
# ADR-[NNN]: [Decision Title]
**Status**: [Proposed / Accepted / Deprecated / Superseded]
**Date**: [YYYY-MM-DD]
**Decision Makers**: [Names and roles]
## Context
[What is the issue that we are seeing that motivates this decision?]
## Decision
[What is the change that we are proposing and/or doing?]
## Consequences
### Positive
- [Benefit 1]
### Negative
- [Trade-off 1]
### Risks
- [Risk with mitigation]
## Alternatives Considered
| Alternative | Pros | Cons | Reason Rejected |
|---|---|---|---|
| [Alt 1] | | | |
ADR Practice Assessment:
| Dimension | Score (1-5) | Evidence |
|---|---|---|
| ADR adoption (% of significant decisions documented) | ||
| ADR discoverability (searchable, linked from code/docs) | ||
| ADR currency (reviewed and updated regularly) | ||
| Decision quality (alternatives considered, trade-offs explicit) |
8. Remediation Roadmap
Prioritized Investment Framework:
Plot all findings on Impact (business value of fixing) vs. Effort (cost to fix):
| Quadrant | Impact | Effort | Action |
|---|---|---|---|
| Quick Wins | High | Low | Do first (Weeks 1-4) |
| Strategic Projects | High | High | Plan and resource (Months 2-6) |
| Fill-Ins | Low | Low | Do when convenient |
| Deprioritize | Low | High | Accept or defer indefinitely |
Remediation Phasing:
| Phase | Timeline | Focus | Budget Allocation | Expected Outcomes |
|---|---|---|---|---|
| Stabilize | Months 1-3 | Fix critical risks, close security gaps, improve monitoring | 30% | Reduce incident frequency by 50%, close Deal Breaker and Significant Risk findings |
| Strengthen | Months 3-9 | Address tech debt, improve test coverage, modernize CI/CD | 40% | Improve deployment frequency, reduce mean time to recovery |
| Scale | Months 9-18 | Architecture evolution, platform modernization, performance optimization | 30% | Handle projected growth, improve developer productivity |
Output Template
# Architecture Review: [System/Platform Name]
**Prepared for**: [Stakeholder] | **Date**: [Date] | **Scope**: [Full / Subsystem / Specific Concern]
## Executive Summary
**Overall Architecture Health**: [X.X] / 5.0 -- Level [N]: [Ad Hoc / Managed / Defined / Quantified / Optimizing]
[3-5 sentence summary: architecture strengths, critical gaps, and recommended
investment priorities. State whether the architecture supports projected business needs.]
### Architecture Scorecard
| Dimension | Score (1-5) | Status | Key Finding |
|---|---|---|---|
| Scalability | X.X | [Severity] | [One-line finding] |
| Reliability | X.X | [Severity] | [One-line finding] |
| Maintainability | X.X | [Severity] | [One-line finding] |
| Security | X.X | [Severity] | [One-line finding] |
| Observability | X.X | [Severity] | [One-line finding] |
| Performance | X.X | [Severity] | [One-line finding] |
| Extensibility | X.X | [Severity] | [One-line finding] |
| **Overall** | **X.X** | | |
### Finding Summary
| Severity | Count | Key Items |
|---|---|---|
| Deal Breaker | X | [Brief descriptions] |
| Significant Risk | X | [Brief descriptions] |
| Manageable | X | [Brief descriptions] |
| Non-Issue | X | -- |
### Estimated Remediation Investment: $[X]M - $[Y]M (18 months)
## 1. System Decomposition
**Total Components**: [X] | **Critical Dependencies**: [X] | **Single Points of Failure**: [X]
[Component inventory and dependency map]
## 2. Scalability Assessment
**Current Headroom**: [X]x before bottleneck
**Scaling Model**: [Vertical / Horizontal / Auto / Mixed]
[Capacity analysis and bottleneck identification]
## 3. Reliability Assessment
**Achieved Uptime**: [99.X%] vs. Target [99.X%]
**RPO**: [X] | **RTO**: [X] | **Last DR Test**: [Date]
[Failure mode analysis, DR assessment]
## 4. Security Architecture
**Defense in Depth Score**: [X.X]/5.0
**Zero Trust Maturity**: [X.X]/5.0
[Threat model findings, compliance gaps]
## 5. Maintainability & Technical Debt
**Tech Debt Ratio**: [X]% | **Estimated Remediation**: $[X]M
[Debt inventory, prioritized remediation]
## 6. Observability
**Observability Score**: [X.X]/5.0
[Gaps in logging, metrics, tracing, alerting]
## 7. Performance
**P95 Response Time**: [X]ms vs. Target [X]ms
[Performance analysis, anti-patterns identified]
## 8. Cloud Architecture
**Well-Architected Alignment**: [X.X]/5.0
[Anti-patterns identified, optimization opportunities]
## 9. Modernization Assessment
**Modernization Urgency**: [X.X]/5.0
**Recommended Strategy**: [Strangler Fig / Incremental / Platform Migration / Encapsulate]
[Modernization pathway with rationale]
## Technical Debt Register
| ID | Debt Item | Category | Impact | Effort | Debt ROI | Priority |
|---|---|---|---|---|---|---|
| TD-001 | [Description] | [Category] | [1-5] | [Person-days] | [X.X] | [Immediate/Q+1/Defer] |
## Remediation Roadmap
### Phase 1: Stabilize (Months 1-3) -- Budget: $[X]
| Initiative | Finding | Effort | Cost | Expected Outcome |
|---|---|---|---|---|
| [Initiative] | [Ref #] | [Person-weeks] | $[X] | [Measurable outcome] |
### Phase 2: Strengthen (Months 3-9) -- Budget: $[X]
| Initiative | Finding | Effort | Cost | Expected Outcome |
|---|---|---|---|---|
### Phase 3: Scale (Months 9-18) -- Budget: $[X]
| Initiative | Finding | Effort | Cost | Expected Outcome |
|---|---|---|---|---|
## Architecture Decision Recommendations
### ADR-001: [Decision Title]
**Recommendation**: [Proposed change]
**Rationale**: [Why this decision, what alternatives were considered]
**Trade-offs**: [What we give up]
## Appendix
- Component inventory (full)
- Dependency map (diagram)
- Performance test results
- Security scan results
Quality Checks
- All seven "-ilities" assessed with individual scores and evidence.
- System decomposition includes component inventory and dependency mapping.
- Scalability assessed at 2x, 5x, and 10x current load with bottleneck identification.
- Reliability analysis includes SLA/SLO evaluation, failure modes, and DR assessment.
- Security review covers defense in depth, zero trust, and threat modeling (STRIDE).
- Technical debt classified by category and type, quantified in dollar terms.
- Debt ROI calculated to prioritize remediation investments.
- API and integration architecture reviewed with pattern recommendations.
- Cloud architecture assessed against well-architected framework.
- Performance anti-patterns identified with specific remediation guidance.
- Modernization pathway recommended with decision framework scoring.
- ADR framework provided for ongoing architecture governance.
- Remediation roadmap phased (Stabilize / Strengthen / Scale) with budget allocation.
- All findings classified as Deal Breaker / Significant Risk / Manageable / Non-Issue.
- Technology maturity model (1-5) applied consistently across all dimensions.
- Cost estimates provided as ranges with confidence levels.