name: review-architecture description: Performs an architectural review of a microservice-based system built on the Microbus framework. Examines service boundaries, dependencies, coupling, API design, resilience, data ownership, agentic workflow composition, observability, security, and operational concerns. Produces a structured report with findings and recommendations.
Workflow
Copy this checklist and track your progress:
Architectural review:
- [ ] Step 1: Build the system map
- [ ] Step 2: Service boundaries and design
- [ ] Step 3: Dependencies and coupling
- [ ] Step 4: API design
- [ ] Step 5: Resilience and reliability
- [ ] Step 6: Data and consistency
- [ ] Step 7: Agentic workflows
- [ ] Step 8: Observability
- [ ] Step 9: Security
- [ ] Step 10: Operational concerns
- [ ] Step 11: Documentation
- [ ] Step 12: Produce the report
Step 1: Build the System Map
Read main/main.go to identify all microservices included in the application and their startup group ordering. Read the manifest.yaml and CLAUDE.md of each microservice to understand its features, downstream dependencies, design rationale, and general properties. Read config.yaml for runtime configuration context. If main/topology.mmd exists, read it to visualize the dependency graph.
Produce a written inventory: for each microservice, note its hostname, purpose (from description), downstream dependencies, outbound events, inbound event sinks, configs, db and cloud properties, and startup group. This inventory is the foundation for all subsequent steps.
Step 2: Service Boundaries and Design
Evaluate each microservice for:
- Separation of concerns - does each service own a single, well-defined domain? Flag services whose description suggests multiple unrelated responsibilities.
- Right-sized services - are services too granular (nano-services adding unnecessary network hops) or too broad (distributed monolith with many unrelated endpoints)?
- Encapsulated persistence - does any service access another service's database, share tables via foreign keys or JOINs, or make assumptions about another service's schema? Grep for SQL statements and check import paths to verify.
- Cohesion - is related functionality grouped together, or scattered across services? Are there endpoints that would be better placed in a different microservice?
- Hostname conventions - do hostnames follow the framework's dotted lowercase convention? Are core services on
.coreand application services on an appropriate domain? - Encapsulated external APIs - is each external API (third-party web service, cloud provider, etc.) wrapped in its own dedicated microservice, as indicated by the
cloudproperty in the manifest? Flag upstream services that call external APIs directly (grep forhttp.NewRequestorhttp.Getoutside of a wrapping microservice) rather than going through an encapsulating microservice and the HTTP egress proxy. - Externalized strings - are user-facing strings (error messages shown to end users, UI labels, etc.) externalized to
resources/text.yamland loaded viasvc.LoadResString, rather than hardcoded in the source? Internal error messages (for logging or developer consumption) need not be externalized.
Step 3: Dependencies and Coupling
Analyze the dependency graph from the downstream sections of each manifest.yaml, and verify against actual code:
- No circular dependencies - is the dependency graph a DAG? Trace the graph from the inventory built in Step 1. Flag any cycles, including indirect cycles (A depends on B depends on C depends on A).
- Loose coupling - do services interact only through their
*apiclient stubs? Grep import statements in each microservice for imports of other microservices' packages. Flag any import of a sibling service's internal package (anything other than its*apipackage). - Fan-out - does any single request cascade through many services? Trace call chains from ingress-facing endpoints. The framework enforces a max call depth of 64, but even chains deeper than 4-5 levels warrant scrutiny for latency and fragility.
- Dependency direction - do higher-level, domain-specific services depend on lower-level, generic services - not the reverse? Flag infrastructure or utility services that depend on domain services.
- Event pairing - does every inbound event sink reference a valid outbound event from the source service? Check that the
sourcein eachinboundEventsentry matches an actualoutboundEventsentry in the source service's manifest. - Appropriate use of events - are events used for loose coupling where appropriate (e.g., notifications, cache invalidation, cross-domain reactions), rather than forcing direct downstream dependencies? Conversely, are events misused where a direct call would be simpler and more reliable (e.g., request/response flows shoehorned into events)?
- Manifest accuracy - does the
downstreamsection of each manifest match the actual*apiimports in the code? Flag missing or stale entries in either direction.
Step 4: API Design
Examine the functions, webs, and outboundEvents sections of each manifest.yaml, and review the corresponding handler signatures in service.go:
- Consistent contracts - are naming, HTTP methods, routes, and error handling patterns uniform across services? Flag inconsistencies in conventions (e.g., one service uses
Loadwhile another usesGetfor the same semantic operation). - Appropriate endpoint type - are functional endpoints used for typed request/response RPCs and web handlers used for HTML/file/streaming responses? Flag functional endpoints that should be web handlers (e.g., file downloads) or vice versa (e.g., web handlers that just marshal JSON).
- Chatty APIs - are there 1+N call patterns where a caller loops over singular operations instead of using a bulk endpoint? Read
service.goof upstream services to check how they call downstream clients. Flag opportunities for bulk operations. - Route conventions - do routes follow kebab-case? Do routes match their handler names (e.g.,
MyHandlermaps to/my-handler)? Are port numbers used appropriately (:443default for HTTPS,:888for control endpoints,:417for events,:0for any-port)? - Method usage - are HTTP methods used correctly (GET for reads, POST for creates/actions, PUT for full updates, DELETE for deletes)? Is
ANYoverused where a specific method would be more precise? - Load balancing - is the
loadBalancingsetting appropriate for each endpoint? Should any unicast endpoint be multicast (none) or vice versa? Are there endpoints that should use a named queue for sticky routing? - Clean URLs - are
//route overrides used appropriately for user-facing pages, and not overused for internal APIs where the hostname prefix provides useful namespacing? - Function signatures - are function signatures clean? Flag functions with excessive parameters that should use a struct, or functions that return too many values.
- Error semantics - are HTTP status codes used consistently and correctly across services? Is 404 always "not found"? Are 4xx (client errors) vs 5xx (server errors) distinguished properly? In Microbus, status codes are attached to errors via
errors.Neworerrors.Trace- flag handlers that return generic 500s for what should be 400-level errors. - Pagination - do
List-style endpoints support pagination (limit/offset or cursor-based)? Flag endpoints that could return unbounded result sets without any paging mechanism. - Backward compatibility - if a function signature has recently changed, will existing callers break? Flag signature changes that remove or reorder parameters without a migration path for consumers.
Step 5: Resilience and Reliability
Review service.go of each microservice:
- Time budget awareness - do long-running operations check the context for cancellation? Are there operations that could exceed the default 20-second time budget without using
pub.Timeout? Flag any downstream calls in a loop without checkingctx.Err(). - Failure isolation - when a downstream call fails, is the error handled gracefully or does it propagate unchecked and take down the caller's operation? Are there fallback paths for non-critical downstream failures?
- Safe concurrency - grep for bare
goandgo funcstatements inservice.gofiles. All goroutines must be launched viasvc.Go(which handles panic recovery and lifecycle) rather than baregostatements. Issvc.Parallelused where multiple independent operations could run concurrently? - Idempotency - are mutating operations safe to retry? This is especially important in Microbus because the framework may retry requests that fail due to ack timeout, meaning a handler could be invoked more than once for the same logical request.
- Graceful degradation - do services handle partial failures from downstream dependencies? For multicast calls (iterating over responses from
NewMulticastClient), is the zero-response case handled appropriately? - Blocking operations - are there synchronous operations in handlers that could block for extended periods without respecting context cancellation (e.g., file I/O, network calls without timeout, channel operations without select)?
- Resource exhaustion - are there maps, slices, or channels in the
Servicestruct that grow without bound? Flag handlers that accumulate state without eviction or size limits.DistribCacheandlru.Cachehave built-in limits, but hand-rolled maps and slices do not.
Step 6: Data and Consistency
Evaluate data ownership and consistency strategies:
- Ownership clarity - does each piece of data have exactly one authoritative source (one microservice that owns the writes)?
- Consistency strategy - is there an explicit choice between strong and eventual consistency for each cross-service data flow? Flag cases where the consistency model is ambiguous or implicit.
- Cross-service data - when services need data owned by others, is the strategy appropriate? The three strategies are: on-demand querying via client stubs (strong consistency, higher latency), caching via
DistribCachewith event-driven invalidation (eventual consistency, volatile storage), or denormalization into local tables (eventual consistency, durable storage). Flag mismatches between the chosen strategy and the consistency requirements. - Cache discipline - if
DistribCacheis used, areMaxAgeandMaxMemoryconfigured inOnStartup? Is the cache used only for data that can tolerate loss and staleness? The framework explicitly warns against usingDistribCacheto share state between peers - flag any such misuse. - No distributed transactions - are there implicit distributed transactions where multiple services are mutated in sequence without compensation or saga logic? Flag operations that partially succeed across services with no rollback path.
- Event ordering - do event sinks assume events arrive in a particular order? In a distributed system, events can arrive out of order or be delivered more than once. Flag handlers that depend on sequencing (e.g., "created" before "updated") without explicit ordering logic or idempotency guards.
- Schema migration safety - for SQL CRUD services, are migration scripts additive and non-destructive (e.g.,
ADD COLUMN, notDROP COLUMNwithout a data migration step)? Are migrations compatible across all three supported database drivers (mysql,pgx,mssql)? Check for driver-specific statements prefixed with-- DRIVER:.
Step 7: Agentic Workflows
Skip this step if no microservice has tasks or workflows in its manifest.yaml. Otherwise, for each microservice that defines a workflow graph or task endpoint:
- No foreman pass-through endpoints - the workflow microservice must not expose
Run<Workflow>,Start<Workflow>,Cancel<Workflow>,<Workflow>Status, or any other wrapper aroundforemanapi.Create/Start/Cancel/Snapshot/Await/Resume/Continue. Grepservice.goforforemanapi.NewClientinside the microservice that owns the workflow - if it appears, the microservice is mediating its own launch and the upstream caller should be talking to the foreman directly instead. The only legitimateforemanapicallers are the actual launchers of a flow (a UI, a webhook handler, a ticker on a different microservice), not the workflow's own host. - Sub-workflow invocation uses subgraphs, not
foremanapi.Run- grep task bodies forforemanapi.NewClient(svc).Runor.Create. A task that runs another workflow this way spawns an independent flow and forfeits re-entry, cascading cancel, single-flow audit trail, and interrupt propagation. The correct shape is a task that callsflow.Subgraph(otherflowapi.OtherFlow.URL(), input), returns nil whileyield=true, and adopts the result fromoutonce it resolves. - State hygiene at subgraph boundaries - if a parent and a subgraph use different field names or one curates state at the boundary, look for a small upstream task that reshapes state with
flow.Delete/flow.Clear/flow.Transform, or a small downstream task that drops internal scratch the parent doesn't want. A subgraph node with neither and a parent that obviously needs to reshape state (e.g., the parent storesuserQuestionbut the subgraph readsprompt) is a smell. The fix is one task at the boundary, not in-task state mutation scattered through the graph. - Reducer correctness at fan-in - for every
graph.SetFanIn(node), check that every state field a branch writes is either replace-safe (last write wins is intended) or has an explicitgraph.SetReducer(field, reducer)declared. The default is Replace; a branch that writes to amessagesfield expecting accumulation will silently lose siblings' contributions if no reducer is registered. Match the reducer to the semantic:Appendfor ordered arrays of deltas,Addfor numeric sums,Unionfor sets,Mergefor objects,And/Orfor booleans,Concatfor strings. - Task idempotency - tasks may execute more than once for the same logical step:
flow.Retry, foreman lease recovery, subgraph re-entry after the child completes, manualRetryvia the foreman API. Tasks that fire external side effects (charging a card, sending an email, mutating a non-transactional store) must carry their own dedupe key or check first whether the effect has already happened. Pure-state tasks need no special treatment. Flag any task body that performs an external side effect without an obvious idempotency guard. - Long-running tasks free the worker - any task that intrinsically takes more than the foreman's
TimeBudgetceiling (default 2m, max 15m) must use the Interrupt/Resume pattern or the Sleep/Retry polling pattern, never block insidetime.Sleepor a synchronous external wait. Grep tasks for long blocking calls without checking the return ofsvc.Sleepor for absence offlow.Retry/flow.Interruptin polling shapes. - LLM tool exposure shape - if the workflow uses
llmapi.ChatLoopas a subgraph, the prepare task should producetoolURLs []stringof canonical endpoint URLs (calculatorapi.Arithmetic.URL()style), not a pre-built[]llmapi.Tool. Pre-resolvingToolschemas in the caller duplicates whatInitChatdoes internally and bypasses the actor-aware OpenAPI fetch that gates tools byrequiredClaims. Flag any prepare task that constructsllmapi.Toolliterals. - Naming convention - microservices whose sole purpose is to host one workflow are conventionally named
<role>.agent(e.g.planner.agent,coordinator.agent). Not a framework requirement, but flag drift from the convention if a project has adopted it elsewhere.
If the application includes a workflow that calls an LLM, also verify that foreman.core, llm.core, and an LLM provider (e.g. claudellm.core) are added to main/main.go. The workflow compiles fine without them, but execution fails at runtime with an ack timeout on whichever endpoint is missing.
Step 8: Observability
Check for adequate instrumentation by reviewing manifest.yaml metrics sections and service.go:
- Metrics coverage - are key business operations instrumented with custom metrics (counters, gauges, or histograms) defined in the manifest? Flag important operations (payments, user actions, external API calls) that lack metrics.
- Dead metrics - are metrics defined in the manifest actually recorded in the code? Grep for each metric's
IncrementorRecordfunction inservice.go. Flag metrics that are defined but never recorded. - Histogram buckets - are
bucketsappropriate for histogram metrics? Flag buckets that don't cover the expected range of values (e.g., latency buckets that max out at 10ms for an operation that routinely takes seconds). - Logging - is
svc.LogInfo/Warn/Error/Debugused with structuredslog-style name=value pairs? Is the"error"label used when logging errors? Grep forfmt.Print,log.Print, orprintlnto flag non-framework logging. - Context propagation - is
ctxpassed through the entire call chain, from handler to downstream calls? Flag functions that create a newcontext.Background()orcontext.TODO()instead of propagating the received context, as this breaks distributed tracing. - Error tracing - are all returned errors wrapped with
errors.Trace? Are new errors created witherrors.New(notfmt.Errorf)? Grep forreturn err(withouterrors.Trace) andfmt.Errorfto find violations.
Step 9: Security
Review authentication, authorization, and secrets management:
- Authentication at the edge - is the HTTP ingress proxy configured with authorization middleware to validate and exchange tokens? Check for the authorization middleware in the HTTP ingress setup.
- Authorization per endpoint - do endpoints that handle sensitive operations (mutations, private data access, admin functions) specify
requiredClaimsin their manifest? Flag sensitive endpoints withoutrequiredClaims. - Stale issuer predicates - flag
iss=~"access.token.core"(oriss==) predicates insiderequiredClaimsas redundant. Issuer verification is enforced at the framework layer via JWKS pinning - the connector rejects any token whoseissdoes not match a framework-pinned hostname before signature verification. The predicate is no longer load-bearing; recommend removal. - Claims expression correctness - are
requiredClaimsboolean expressions syntactically correct and logically sound? Check for proper use of&&,||,!, comparison operators, and regex matching. Flag overly permissive expressions (e.g.,role=="admin" || true). - Input validation - are user-provided inputs validated before processing? Flag endpoints that pass user input directly into SQL queries (injection risk), file paths (traversal risk), or HTML output (XSS risk) without sanitization.
- Secrets management - are secret config properties marked
secret: truein the manifest? Are secret values placed inconfig.local.yaml(git-ignored) and not inconfig.yamlor hardcoded in source? Grep for patterns that look like hardcoded credentials (API keys, passwords, connection strings).
Step 10: Operational Concerns
Evaluate deployment, configuration, and production readiness:
- Startup group ordering - in
main/main.go, are microservices added to the application in the correct startup group order? Services must start after their dependencies. Verify against the dependency graph from Step 3. - Configuration externalization - is behavior tunable via config properties without redeployment? Are config properties defined with appropriate
validationrules in the manifest? - Appropriate defaults - do config properties have sensible defaults? Flag configs with no default that could cause startup failures or unexpected behavior.
- Resource lifecycle - are resources (database connections, caches, external clients) initialized in
OnStartupand cleaned up inOnShutdown? Flag resources that are initialized inline (e.g., in a handler) without cleanup. - Ticker intervals - are ticker intervals appropriate for their purpose? Flag tickers that run too frequently (sub-second) or too infrequently for their stated goal.
- Embedded resources - are static files properly placed in
resources/and accessed viasvc.ReadResFileorsvc.ResFS? - Test coverage - does each microservice have integration tests in
service_test.go? Do tests useapplication.New()withRunInTest(t)? Are mocks provided for downstream dependencies that are unavailable in the test environment? - No TODO leftovers - grep for
TODO,FIXME,HACK, andXXXcomments across the codebase. Flag any that indicate unfinished work or known issues that should be resolved before production. - Version tracking - is the
Versionconst in each microservice's<name>api/definition.gobeing incremented when changes are made?
Step 11: Documentation
Evaluate whether the system is sufficiently documented for a new developer or agent to understand and safely evolve it:
- CLAUDE.md quality - does each microservice have an
CLAUDE.mdthat explains its purpose, design rationale, tradeoffs, and the context needed to evolve it? Flag microservices with missing, empty, or boilerplate-onlyCLAUDE.mdfiles. The file should capture why decisions were made, not just describe what the code does. - Manifest completeness - does each
manifest.yamlhave meaningfuldescriptionfields for the service, its endpoints, configs, events, and tickers? Flag entries with missing or placeholder descriptions. - Godoc comments - are exported types, functions, and methods in
service.goand*api/client.godocumented with Godoc comments? Flag exported symbols that lack comments, especially public API surfaces in*apipackages that downstream consumers rely on. - Complex logic - are non-obvious algorithms, business rules, or workarounds explained with inline comments? Flag complex code blocks (deeply nested logic, regex patterns, SQL queries with many joins/conditions) that lack explanatory comments.
- Config documentation - are config properties documented with clear descriptions in the manifest? Would a system operator understand what each config does, what values are valid, and what the default means?
Step 12: Produce the Report
Compile findings into a structured report. For each category (Steps 2-11), list:
- Findings - specific observations, both positive and negative, with file paths and line numbers where applicable
- Recommendations - actionable suggestions for improvement, ordered by severity
Use severity levels:
- Critical - architectural issues that could cause system-wide failures, data loss, or security vulnerabilities
- Warning - design concerns that may lead to problems at scale or during evolution
- Info - minor improvements or deviations from best practices
Omit categories where no issues were found, but mention them in the summary as areas that passed review.
Format the report as follows:
# Architectural Review
## Summary
Brief overview of the system, number of microservices reviewed, and high-level assessment.
Note categories that passed review without issues.
## Service Boundaries and Design
### Findings
- ...
### Recommendations
- [severity] ...
## Dependencies and Coupling
### Findings
- ...
### Recommendations
- [severity] ...
(repeat for each category that has findings)
## Conclusion
Overall assessment and prioritized action items.
Present the report to the user.