2025-09-08
The Anatomy of a Good Technical RFC: Section-by-Section Breakdown
A guide to crafting technical RFCs that actually get approved and drive successful implementations, based on reviewing hundreds of documents
Most RFCs written for a critical system go unread past the first paragraph; what separates a useful document from a bureaucratic exercise starts with a clear problem statement. Reviewing hundreds of RFCs across multiple companies surfaces recurring patterns.
The best RFCs aren’t written by the most senior architects. They’re written by engineers who understand that an RFC is fundamentally a sales document: selling a solution to multiple audiences with competing priorities. And like any good sales pitch, structure matters as much as content.
The RFC That Reframes the Problem
Consider a junior engineer’s notification system RFC that gets approved in record time while a senior architect’s more technically sophisticated proposal sits in review for months. The difference: the junior engineer understood the audience and structured the RFC to answer questions in the order stakeholders actually ask them.
Let’s dissect a real notification system RFC section by section, examining why each part works and what reviewers actually look for. This isn’t theoretical - this RFC led to a production system handling millions of notifications daily, and the implementation journey revealed which sections proved most valuable.
Executive Summary: The 30-Second Pitch
The executive summary is your elevator pitch. You have about 30 seconds to convince a busy VP or senior engineer that this document is worth their time. Here’s what works:
What Actually Works
We need to implement a robust, scalable user notification system that can handle
real-time updates, push notifications, email notifications, and in-app notifications
across our platform. This system will serve as the backbone for user engagement,
critical alerts, and feature announcements.
This summary works because it:
- States the what clearly (notification system)
- Lists specific capabilities (real-time, push, email, in-app)
- Connects to business value (user engagement, critical alerts)
- Avoids technical jargon
Common Mistakes
Weak version:
This RFC proposes implementing a microservices-based event-driven architecture
utilizing Kafka, PostgreSQL, and WebSockets to facilitate asynchronous message
delivery across multiple channels with configurable retry mechanisms.
The weak version loses executives at “microservices-based” and never explains why anyone should care. If a system can’t be explained to a product manager in one paragraph, the design probably needs more clarity.
Insider Tips
What reviewers actually look for:
- Scope clarity: Is this a complete rewrite or an enhancement?
- Business alignment: Does this solve a real problem or is it resume-driven development?
- Risk assessment: Are you being honest about complexity?
The notification RFC nailed this by focusing on user impact first, technology second. That clarity helps maintain focus during the inevitable scope creep discussions.
Problem Statement: Quantifying the Pain
The problem statement is where you build urgency. Numbers matter here - vague problems get vague timelines.
Effective Problem Framing
The notification RFC quantified pain points brilliantly:
### Current Pain Points
- Users miss important updates about their projects
- No centralized way to manage notification preferences
- Manual notification sending is error-prone and not scalable
### Business Impact
- Reduced user engagement and retention
- Increased support tickets due to missed communications
- Poor user experience leading to churn
Notice how each pain point maps to measurable business impact. Tracking these exact metrics during implementation shows support tickets dropping by 23% within three months.
Weak Problem Statements
Here’s what doesn’t work:
The current system is outdated and difficult to maintain. Engineers complain
about the codebase and adding new features is challenging.
This tells me nothing actionable. How outdated? What specific maintenance issues? Which features are blocked? Without specifics, this reads like every legacy system ever.
The Data That Matters
Strong RFCs include:
- Current metrics: “847 support tickets last month about missed notifications”
- Cost implications: “Engineers spend 15% of sprint time on manual notification tasks”
- Opportunity cost: “Three feature launches delayed due to notification limitations”
Reviewing metrics six months post-implementation confirms this: the problems quantified upfront become the success metrics. The RFC essentially writes its own success criteria.
Proposed Solution: Balancing Vision and Specificity
This is where most RFCs go off the rails. Engineers either get lost in implementation details or stay so high-level that nobody knows what’s actually being built.
The Goldilocks Zone
The notification RFC found the perfect balance:
### System Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Notification │ │ Notification │ │ Notification │
│ Sources │───▶│ Engine │───▶│ Channels │
└─────────────────┘ └──────────────────┘ └─────────────────┘
### Core Components
- Event Processor: Handles incoming notification events
- Template Engine: Manages notification templates and personalization
- Rate Limiting: Prevents notification spam
This works because it:
- Shows the big picture architecture visually
- Breaks down into understandable components
- Explains what each component does, not how
Over-Engineering Red Flags
Watch out for:
- Solutions looking for problems (“We’ll use GraphQL subscriptions because they’re modern”)
- Technology bingo (“Kubernetes, Istio, Envoy, Linkerd…”)
- Premature optimization (“We’ll shard the database from day one”)
In practice, implementations often start simpler than the RFC suggests. The modular design allows adding complexity gradually: rate limiting in month three, not day one.
Technical Implementation: Where Rubber Meets Road
This section separates the dreamers from the builders. Good technical specs are concrete enough to estimate but flexible enough to adapt.
Database Schema That Survived Production
The RFC’s database schema mostly survived contact with reality:
CREATE TABLE notification_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id) ON DELETE CASCADE,
notification_type VARCHAR(100) NOT NULL,
template_id UUID REFERENCES notification_templates(id),
data JSONB DEFAULT '{}',
status VARCHAR(20) DEFAULT 'pending',
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
read_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
What made this effective:
- Audit trail built-in: sent_at, delivered_at, read_at timestamps
- Flexibility via JSONB: The data field handled unforeseen requirements
- Status tracking: Essential for debugging production issues
What Changed in Production
Production reality adds:
- Index requirements often missed (compound index on user_id + status + created_at)
- Partition strategy for time-series data (monthly partitions)
- Archive strategy (moving old notifications to cold storage)
The production debugging post details how these requirements surface in practice.
API Design That Scales
Good RFCs show representative API endpoints:
POST /api/notifications/send
GET /api/notifications/user/:userId
PUT /api/notifications/:id/read
But great RFCs also consider:
- Pagination strategies for list endpoints
- Batch operations for efficiency
- Versioning strategy for future changes
- Rate limiting at the API level
Cursor-based pagination becomes necessary after offset pagination creates performance issues at scale; something the RFC could have anticipated.
Implementation Phases: Realistic Timeline Management
This is where optimism meets reality. Every RFC underestimates timelines, but good ones underestimate less catastrophically.
Phases That Made Sense
### Phase 1: Core Infrastructure (Weeks 1-4)
- Database schema implementation
- Basic notification engine
- In-app notification system
### Phase 2: Advanced Features (Weeks 5-8)
- Push notifications
- Template management system
- Scheduling and rate limiting
Why this phasing worked:
- Value delivery in phase 1: Users saw notifications within 4 weeks
- Risk frontloading: Hard problems (real-time delivery) came first
- Learning incorporation: Phase 2 plans adjusted based on phase 1 lessons
Timeline Reality Check
What actually happened:
- Phase 1: 6 weeks (not 4) due to authentication integration complexity
- Phase 2: 10 weeks (not 4) after edge cases in rate limiting surfaced
- Phase 3: Partially descoped based on actual usage patterns
The implementation series documents how to adapt the timeline while maintaining stakeholder trust.
Red Flags in Timelines
Watch for:
- No buffer for discoveries (“Week 1: Implement everything”)
- No testing time allocated
- Dependency on other teams not accounted for
- “Simple integration” with external services (it’s never simple)
Technical Considerations: The Reality Check Section
This section reveals whether the authors have actually built similar systems or are just good at reading blog posts.
Scalability That Matters
The RFC got specific about scale:
### Performance Targets
- Notification delivery: < 100ms for in-app, < 5s for email
- System throughput: 10,000+ notifications per second
- Database query performance: < 50ms for preference lookups
These aren’t arbitrary numbers. They’re derived from:
- Current user base (10,000 notifications/second = peak load × 3)
- User experience research (100ms feels instant)
- Infrastructure constraints (database connection limits)
What Actually Got Measured
Six months in production:
- P99 in-app delivery: 87ms
- P99 email delivery: 3.2s
- Peak throughput: 7,800/second (sufficient)
- Database query P99: 124ms (needed optimization)
The analytics and optimization post details how these targets get met.
Security Considerations That Save Your Bacon
Good RFCs address:
- Authentication: “JWT tokens with 15-minute expiry”
- Authorization: “Role-based access with granular permissions”
- Rate limiting: “Per-user limits with exponential backoff”
- Data privacy: “PII encryption at rest, GDPR compliance”
A security incident can surface months in, when someone attempts to use the notification system for spam. A rate limiting strategy specified in the RFC prevents the platform from becoming an unwitting spam relay.
Testing Strategy: Beyond “We’ll Write Tests”
Testing sections reveal whether teams actually practice TDD or just talk about it.
Testing That Actually Happened
### Load Tests
- High-volume notification sending
- Concurrent user connections
- Database performance under load
- Queue processing capacity
What made this valuable:
- Specific scenarios: Not just “load testing” but what specifically
- Performance criteria: Clear pass/fail conditions
- Tool selection: We used K6 for load testing, as suggested
Testing Gaps That Surfaced
The RFC missed:
- Chaos testing (a Redis failure can surface in month 2)
- Cross-browser WebSocket compatibility
- Mobile app battery impact from persistent connections
- International character set handling in templates
Good RFCs acknowledge that you can’t predict every test scenario but provide a framework for discovering what you missed.
Monitoring & Analytics: What You’ll Actually Look At
Most monitoring sections list every possible metric. Good ones identify the 3-5 metrics that actually indicate system health.
Metrics That Mattered
### Key Metrics
- Delivery success rate (target: 99.9%)
- Delivery time by channel
- User engagement rates
- Support ticket volume
After six months, these four metrics are the ones worth checking daily. Everything else is noise until something breaks.
Alert Fatigue Is Real
The RFC suggested alerting on:
- High error rates (> 5%)
- Delivery delays (> 10s)
- System resource usage (> 80%)
What actually warrants an alert:
- Delivery success rate < 99% (not 95%)
- Email delivery P99 > 30s (not 10s)
- Database connection pool exhaustion (not CPU usage)
The real-time delivery post explains how to tell what actually indicates problems versus normal variance.
Cost Analysis: The Budget Reality
This is where engineering meets business. Good cost sections acknowledge both immediate and ongoing costs.
Costs We Could Predict
### Infrastructure Costs
- Database: $200-500/month
- Message Queue: $50-150/month
- Push Notification Services: $0.50 per 1000 notifications
These were reasonably accurate because they’re based on published pricing.
Costs That Got Missed
- CloudWatch logs: $300/month (we log everything)
- S3 for notification archive: $150/month
- Additional database read replica: $400/month
- Engineering time for maintenance: 0.5 FTE ongoing
The total monthly cost ended up being ~$1,800 versus the $500-800 implied by the RFC. Still worth it, but stakeholders appreciate honesty about TCO.
ROI That Materialized
The RFC projected:
- 20-30% reduction in support tickets
- 5-15% increase in user retention
Actual results:
- 23% reduction in support tickets
- 8% increase in user retention
- Unexpected win: 40% faster feature adoption
Risks & Mitigation: Honest Assessment
The best risk sections admit what the authors don’t know.
Risks That Materialized
Risk: Database performance degradation with high volume
Mitigation: Proper indexing, read replicas, query optimization
This risk absolutely materializes. At 2 million notifications, queries start timing out. The suggested mitigation works, but takes three weeks to implement properly.
Risks That Went Unanticipated
- WebSocket connection limits in the load balancer
- Template rendering performance with nested conditionals
- Time zone edge cases for scheduled notifications
- Mobile carriers blocking the SMS provider
Good RFCs acknowledge unknown unknowns and build in flexibility to handle them.
Success Criteria: Measurable Outcomes
This section is your contract with stakeholders. Make it measurable and realistic.
Criteria That Worked
### Technical Success
- 99.9% notification delivery success rate
- < 100ms in-app notification delivery
- System handles 10,000+ notifications per second
These were:
- Measurable: Specific numbers, not “fast” or “reliable”
- Achievable: Based on similar systems, not wishful thinking
- Relevant: Tied directly to user experience
Moving Goalposts
What changed after launch:
- Success rate target moved to 99.5% (99.9% was too expensive)
- In-app delivery relaxed to 200ms (users couldn’t tell the difference)
- Throughput requirement dropped to 5,000/second (actual peak load)
The key is documenting why criteria changed and getting stakeholder buy-in on adjustments.
The Reviews That Actually Matter
After reviewing hundreds of RFCs, here’s what different stakeholders actually care about:
What VPs/Directors Look For
- Executive summary that explains business value
- Cost analysis with clear ROI
- Timeline with milestone deliverables
- Risk section that doesn’t hide complexity
What Senior Engineers Look For
- Technical implementation that shows deep understanding
- Scalability considerations based on actual metrics
- Alternative approaches and why they were rejected
- Integration points with existing systems
What Team Leads Look For
- Implementation phases that deliver value iteratively
- Testing strategy that’s actually executable
- Success criteria their team can rally around
- Monitoring approach that won’t create alert fatigue
What Security Teams Look For
- Authentication/authorization approach
- Data privacy considerations
- Rate limiting and abuse prevention
- Audit trail capabilities
Lessons From Implementation
Six months after implementing the notification system, these are the areas the RFC should have emphasized more:
Documentation Is Part of the System
The RFC becomes the primary documentation. Structuring it explicitly as living documentation from the start pays dividends.
Migration Strategy Matters
The RFC focused on the new system but barely mentioned migrating from the old one. Migration can take 40% of total effort.
Operational Runbooks Save Lives
The RFC should include or mandate operational runbooks. The gap becomes obvious after the first production incident.
Feature Flags Are Your Friend
The RFC mentioned phased rollout but didn’t emphasize feature flags. They prevent rollbacks when issues emerge.
The RFC as a Living Document
The best RFCs aren’t abandoned after approval. They evolve into:
- Architecture documentation
- Onboarding materials for new team members
- Decision logs for future reference
- Post-mortem context when things go wrong
A well-maintained notification system RFC can accumulate dozens of commits post-approval, documenting every significant deviation and learning.
What Makes RFCs Actually Useful
Here’s what separates useful RFCs from bureaucratic exercises:
Write for Multiple Audiences
Your RFC has at least four audiences: executives, architects, implementers, and operators. Structure it so each can find what they need quickly.
Be Honest About Uncertainty
The best RFCs include sections titled “What We Don’t Know Yet” or “Assumptions That Might Be Wrong.”
Include Escape Hatches
Good RFCs explain not just how to build the system but how to back out if things go wrong. This paradoxically makes approval easier.
Make Success Measurable
Vague success criteria lead to endless debates. Specific numbers force clarity about what you’re actually trying to achieve.
Show Your Work
Include enough detail that another team could implement your design, but not so much that you’re writing the code in prose.
The Perfect RFC Doesn’t Exist
No RFC is perfect. The notification system RFC dissected in this post had significant gaps: it underestimated complexity, missed operational concerns, and was optimistic on timelines. But it succeeded where it mattered: it aligned stakeholders, guided implementation, and created a framework for iteration.
The best RFC isn’t the one that predicts everything perfectly. It’s the one that provides enough structure to start building, enough flexibility to adapt, and enough honesty to maintain trust when reality inevitably differs from the plan.
Final Thoughts: The RFC Paradox
A pattern holds across many organizations: the teams that write the best RFCs often need them the least. They have strong communication, clear thinking, and good engineering practices. The RFC is just a formalization of what they already do well.
Conversely, teams that struggle with RFCs often have deeper issues - unclear requirements, competing visions, or technical debt that makes any solution complex. The RFC becomes a forcing function for addressing these issues.
The notification system RFC succeeded not because it was perfect, but because it forced important conversations early. The implementation went smoothly not because the RFC predicted everything, but because it created a shared understanding of what was being built and why.
That’s the real value of a good RFC: it’s not about documenting the perfect solution, it’s about aligning everyone on a good-enough solution and providing a framework for making it better over time.
References
- RFC Editor - The authoritative source for published RFCs, maintained by the RFC Production Center
- IETF RFC Process - Official IETF documentation on how RFCs are created, reviewed, and published
- Architectural Decision Records - adr.github.io - Community resource on ADRs, a lightweight alternative for capturing architecture decisions
- ADR Templates - Collection of ADR templates including the widely used Nygard format
- Google Engineering Practices - Google’s public documentation on engineering practices including design review processes
Related posts
Hard-won insights from RFC processes, stakeholder management, and turning technical debates into collaborative decisions that stick.
Documentation debt kills organizations faster than technical debt. A comprehensive guide to treating documentation as critical infrastructure and scaling knowledge across engineering teams.
How Arnold Mindell's Deep Democracy principles can transform technical decision-making, create psychological safety, and ensure every voice strengthens your architecture - not just the loudest ones
An honest take on the gap between beautiful RFC designs and messy production reality, featuring real-world lessons from implementing notification systems at scale
What Aurora Serverless v2 actually is under the hood: the shared Aurora storage layer, the ACU-driven compute layer, the Caspian heat-management substrate, scale-to-zero mechanics, and mixed-mode clusters.