name: rfc description: > Draft, review, and iterate on RFC (Request for Comments) documents for infrastructure, reliability, and platform changes. Use when proposing a significant technical change, new architecture, major process shift, or any decision that needs cross-team alignment before implementation. Covers problem framing, solution options, trade-off analysis, risk assessment, and rollout plan. Trigger keywords: RFC, request for comments, proposal, design doc, design document, architecture proposal, ADR, architecture decision record, technical proposal, propose a change, write a proposal, get alignment, we need to decide, compare options, technical decision, platform change, migration plan, new architecture. allowed-tools: Read Glob Grep Bash(git log:*)
RFC Skill
Setup Check
Before loading context files, check if context/CONTEXT.md exists in the current directory.
If context/CONTEXT.md exists — read it and proceed normally.
If context/CONTEXT.md does not exist — this skill was installed standalone (e.g. via npx skills add). Ask the user these questions before proceeding:
- Role —
junior-sre/senior-sre/sre-manager(shapes output depth and tone) - Cloud provider —
aws/gcp/azure/on-prem/hybrid - Observability stack — e.g. Datadog, Prometheus+Grafana, New Relic
- Company name and primary services affected (if relevant to this task)
Use the answers inline for this session. For persistent setup across all skills, suggest:
pipx install sre-agent
sre-agent init
Instructions
Step 1: Load Context
Read context/CONTEXT.md and context/company/tech-stack.md.
Persona adjustments:
- junior-sre: provide extra guidance on each RFC section, explain why it matters.
- senior-sre: focus on trade-off depth and risk analysis.
- sre-manager: emphasize business impact, team bandwidth, and decision timeline.
Step 2: Determine the RFC Type
Classify the RFC scope to calibrate how formal and detailed it needs to be:
| Type | Scope | Typical audience | Approval needed |
|---|---|---|---|
| Lightweight | Single-service change, low risk | SRE team | Team lead |
| Standard | Multi-service or platform-wide | Engineering teams | Staff engineer + manager |
| Major | Company-wide infra, security posture, data architecture | All engineering + leadership | VP Engineering / CTO |
Step 3: Gather RFC Inputs
Ask the user for:
- Problem statement — what is broken, missing, or needs to improve? Why now?
- Proposed solution — what is the leading idea (even rough)?
- Alternatives considered — what other options exist or have been ruled out?
- Constraints — budget, timeline, team bandwidth, compliance, existing contracts
- Who needs to weigh in — which teams or individuals are stakeholders?
Step 4: Draft the RFC
Populate the full RFC document using the structure below.
RFC Structure
# RFC: <Title>
**RFC Number:** RFC-<NNNN>
**Status:** Draft | Under Review | Accepted | Rejected | Superseded
**Author(s):** <role or name>
**Created:** YYYY-MM-DD
**Last Updated:** YYYY-MM-DD
**Review Deadline:** YYYY-MM-DD
**Stakeholders:** <teams or roles that must review>
**Approvers:** <who has final say>
## Summary
One paragraph. What are we changing, why, and what is the expected outcome? Readers should understand the full proposal after reading only this section.
## Problem Statement
Describe the current state and why it is a problem. Be specific:
- What is the measurable impact of the current situation? (latency numbers, incident count, engineering hours lost, cost, customer complaints)
- Who is affected?
- Why is this the right time to address it?
Avoid: "our system is slow." Use: "P99 latency for the order-service is 2.3s against a 500ms SLO target, causing error budget exhaustion 3 months in a row and 4 P1 incidents in Q4."
## Goals
Bullet list of what success looks like. Each goal should be measurable:
- ✅ Reduce order-service P99 latency from 2.3s to < 500ms
- ✅ Eliminate connection pool exhaustion as a failure mode
- ✅ No additional headcount required for implementation
Non-goals (explicitly out of scope for this RFC):
- ❌ Migrating the legacy reporting pipeline (separate RFC)
- ❌ Changes to the authentication service
## Proposed Solution
Describe the chosen approach in enough detail that an engineer can evaluate it.
Include:
- Architecture diagram (ASCII or description)
- Key components and how they interact
- Data flow changes
- API or interface changes
- Configuration or infrastructure changes
Implementation phases (break into reviewable chunks):
- Phase 1: [what, timeline, who]
- Phase 2: [what, timeline, who]
- Phase 3: [what, timeline, who]
## Alternative Solutions
For each alternative, explain:
- What it is
- Why it was considered
- Why it was not chosen (the key reason)
| Alternative | Pros | Cons | Reason rejected |
|---|---|---|---|
| Option A | |||
| Option B | |||
| Do nothing |
The "do nothing" option must always be included with an honest assessment of its cost.
## Trade-offs and Risks
Trade-offs
What are we giving up or accepting as a cost of this approach?
- Performance vs. consistency
- Operational simplicity vs. feature richness
- Cost vs. reliability
- Speed of delivery vs. safety
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Migration causes data loss | Low | High | Dual-write with validation before cutover |
| Adoption slow across teams | Medium | Medium | Provide migration tooling and office hours |
| Cost higher than estimated | Medium | Low | Phase rollout, monitor cost per phase |
Reliability Impact
- Does this change affect any SLO?
- What is the failure mode if this goes wrong during rollout?
- What is the rollback plan?
## Rollout Plan
How will this change be deployed safely?
- Feature flagged? (yes / no)
- Canary rollout? (yes / no — what % and what duration)
- Rollback procedure: [describe]
- Validation criteria before proceeding to next phase: [describe]
- Who owns the rollout?
Rollback trigger conditions:
- [metric threshold that triggers rollback]
## Operational Impact
- On-call burden: does this increase or decrease toil? By how much?
- Runbook changes needed: [list]
- Alert changes needed: [list]
- Dashboard changes needed: [list]
- SLO changes needed: [list]
## Resource Requirements
| Resource | Estimate | Notes |
|---|---|---|
| Engineering time | X weeks | Y engineers |
| Infrastructure cost delta | +/- $X/month | |
| External dependencies | [list] | |
| Timeline | Start: YYYY-MM-DD | Done: YYYY-MM-DD |
## Open Questions
List questions that are not yet resolved and who needs to answer them:
- Q: Should we use X or Y for the queue? Owner: @platform-team Due: YYYY-MM-DD
- Q: What is the data retention requirement? Owner: @legal Due: YYYY-MM-DD
## Decision
Fill in after review period closes.
Decision: Accepted / Rejected / Accepted with modifications
Rationale: [why]
Conditions / modifications: [if any]
Decided by: [approvers] on YYYY-MM-DD
## References
- Related RFCs: [links]
- Related incidents/postmortems: [links]
- External references: [links]
Step 5: Review Facilitation
After drafting, suggest:
- Who should review (based on stakeholders named above)
- A review deadline (typically 5–10 business days for standard, 2 weeks for major)
- The review format: async comments, or a synchronous review meeting?
For major RFCs, recommend a structured review meeting with:
- Presenting author walks through problem + proposed solution (15 min)
- Open Q&A and challenge session (30 min)
- Decision or next steps (15 min)
Step 6: Iteration Support
If the user shares feedback or comments from reviewers:
- Summarize the key objections
- Suggest how to address each objection
- Identify which objections should change the proposal vs. which should be documented as trade-off acknowledgments
Examples
Example: RFC for migrating from polling to event-driven architecture
Problem: Polling-based job processor adds 10–30s latency to user notifications and consumes 40% of DB read capacity on polling queries. SLO for notification delivery (< 5s) is breached 12 days per month.
Goal: Deliver notifications within 5s of trigger event with < 0.1% loss.
Proposed solution: Replace polling with SQS FIFO queue. Job producers publish events. Notification service consumes from queue. Estimated DB read reduction: 60%.
Alternatives:
- WebSockets: rejected — requires stateful connection management, high ops cost
- Reduce polling interval: rejected — increases DB load further, does not meet SLO
- Do nothing: rejected — SLO breach is accelerating, impacting customer retention
Guidelines
- The problem statement must include data, not just complaints.
- Every RFC must have a "do nothing" alternative with an honest cost.
- Open questions should be tracked with owners and due dates — not left vague.
- An RFC is not a design spec — it is a decision document. Enough detail to evaluate, not enough detail to implement without further work.
- Persona (sre-manager): always include a one-paragraph executive summary and a resource requirements table. Flag if the RFC requires headcount or budget approval.
- RFCs should be version-controlled. Suggest the file path:
docs/rfcs/RFC-<NNNN>-<short-title>.md