Skip to content

2025-09-08

The Anatomy of a Good Technical RFC: Section-by-Section Breakdown

A guide to crafting technical RFCs that actually get approved and drive successful implementations, based on reviewing hundreds of documents

Most RFCs written for a critical system go unread past the first paragraph; what separates a useful document from a bureaucratic exercise starts with a clear problem statement. Reviewing hundreds of RFCs across multiple companies surfaces recurring patterns.

The best RFCs aren’t written by the most senior architects. They’re written by engineers who understand that an RFC is fundamentally a sales document: selling a solution to multiple audiences with competing priorities. And like any good sales pitch, structure matters as much as content.

The RFC That Reframes the Problem

Consider a junior engineer’s notification system RFC that gets approved in record time while a senior architect’s more technically sophisticated proposal sits in review for months. The difference: the junior engineer understood the audience and structured the RFC to answer questions in the order stakeholders actually ask them.

Let’s dissect a real notification system RFC section by section, examining why each part works and what reviewers actually look for. This isn’t theoretical - this RFC led to a production system handling millions of notifications daily, and the implementation journey revealed which sections proved most valuable.

Executive Summary: The 30-Second Pitch

The executive summary is your elevator pitch. You have about 30 seconds to convince a busy VP or senior engineer that this document is worth their time. Here’s what works:

What Actually Works

We need to implement a robust, scalable user notification system that can handle 
real-time updates, push notifications, email notifications, and in-app notifications 
across our platform. This system will serve as the backbone for user engagement, 
critical alerts, and feature announcements.

This summary works because it:

  • States the what clearly (notification system)
  • Lists specific capabilities (real-time, push, email, in-app)
  • Connects to business value (user engagement, critical alerts)
  • Avoids technical jargon

Common Mistakes

Weak version:

This RFC proposes implementing a microservices-based event-driven architecture 
utilizing Kafka, PostgreSQL, and WebSockets to facilitate asynchronous message 
delivery across multiple channels with configurable retry mechanisms.

The weak version loses executives at “microservices-based” and never explains why anyone should care. If a system can’t be explained to a product manager in one paragraph, the design probably needs more clarity.

Insider Tips

What reviewers actually look for:

  • Scope clarity: Is this a complete rewrite or an enhancement?
  • Business alignment: Does this solve a real problem or is it resume-driven development?
  • Risk assessment: Are you being honest about complexity?

The notification RFC nailed this by focusing on user impact first, technology second. That clarity helps maintain focus during the inevitable scope creep discussions.

Problem Statement: Quantifying the Pain

The problem statement is where you build urgency. Numbers matter here - vague problems get vague timelines.

Effective Problem Framing

The notification RFC quantified pain points brilliantly:

### Current Pain Points
- Users miss important updates about their projects
- No centralized way to manage notification preferences
- Manual notification sending is error-prone and not scalable

### Business Impact
- Reduced user engagement and retention
- Increased support tickets due to missed communications
- Poor user experience leading to churn

Notice how each pain point maps to measurable business impact. Tracking these exact metrics during implementation shows support tickets dropping by 23% within three months.

Weak Problem Statements

Here’s what doesn’t work:

The current system is outdated and difficult to maintain. Engineers complain 
about the codebase and adding new features is challenging.

This tells me nothing actionable. How outdated? What specific maintenance issues? Which features are blocked? Without specifics, this reads like every legacy system ever.

The Data That Matters

Strong RFCs include:

  • Current metrics: “847 support tickets last month about missed notifications”
  • Cost implications: “Engineers spend 15% of sprint time on manual notification tasks”
  • Opportunity cost: “Three feature launches delayed due to notification limitations”

Reviewing metrics six months post-implementation confirms this: the problems quantified upfront become the success metrics. The RFC essentially writes its own success criteria.

Proposed Solution: Balancing Vision and Specificity

This is where most RFCs go off the rails. Engineers either get lost in implementation details or stay so high-level that nobody knows what’s actually being built.

The Goldilocks Zone

The notification RFC found the perfect balance:

### System Architecture

┌─────────────────┐  ┌──────────────────┐  ┌─────────────────┐
│  Notification  │  │  Notification  │  │  Notification  │
│  Sources  │───▶│  Engine  │───▶│  Channels  │
└─────────────────┘  └──────────────────┘  └─────────────────┘

### Core Components
- Event Processor: Handles incoming notification events
- Template Engine: Manages notification templates and personalization
- Rate Limiting: Prevents notification spam

This works because it:

  • Shows the big picture architecture visually
  • Breaks down into understandable components
  • Explains what each component does, not how

Over-Engineering Red Flags

Watch out for:

  • Solutions looking for problems (“We’ll use GraphQL subscriptions because they’re modern”)
  • Technology bingo (“Kubernetes, Istio, Envoy, Linkerd…”)
  • Premature optimization (“We’ll shard the database from day one”)

In practice, implementations often start simpler than the RFC suggests. The modular design allows adding complexity gradually: rate limiting in month three, not day one.

Technical Implementation: Where Rubber Meets Road

This section separates the dreamers from the builders. Good technical specs are concrete enough to estimate but flexible enough to adapt.

Database Schema That Survived Production

The RFC’s database schema mostly survived contact with reality:

CREATE TABLE notification_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id) ON DELETE CASCADE,
    notification_type VARCHAR(100) NOT NULL,
    template_id UUID REFERENCES notification_templates(id),
    data JSONB DEFAULT '{}',
    status VARCHAR(20) DEFAULT 'pending',
    sent_at TIMESTAMP,
    delivered_at TIMESTAMP,
    read_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW()
);

What made this effective:

  • Audit trail built-in: sent_at, delivered_at, read_at timestamps
  • Flexibility via JSONB: The data field handled unforeseen requirements
  • Status tracking: Essential for debugging production issues

What Changed in Production

Production reality adds:

  • Index requirements often missed (compound index on user_id + status + created_at)
  • Partition strategy for time-series data (monthly partitions)
  • Archive strategy (moving old notifications to cold storage)

The production debugging post details how these requirements surface in practice.

API Design That Scales

Good RFCs show representative API endpoints:

POST  /api/notifications/send
GET  /api/notifications/user/:userId
PUT  /api/notifications/:id/read

But great RFCs also consider:

  • Pagination strategies for list endpoints
  • Batch operations for efficiency
  • Versioning strategy for future changes
  • Rate limiting at the API level

Cursor-based pagination becomes necessary after offset pagination creates performance issues at scale; something the RFC could have anticipated.

Implementation Phases: Realistic Timeline Management

This is where optimism meets reality. Every RFC underestimates timelines, but good ones underestimate less catastrophically.

Phases That Made Sense

### Phase 1: Core Infrastructure (Weeks 1-4)
- Database schema implementation
- Basic notification engine
- In-app notification system

### Phase 2: Advanced Features (Weeks 5-8)
- Push notifications
- Template management system
- Scheduling and rate limiting

Why this phasing worked:

  • Value delivery in phase 1: Users saw notifications within 4 weeks
  • Risk frontloading: Hard problems (real-time delivery) came first
  • Learning incorporation: Phase 2 plans adjusted based on phase 1 lessons

Timeline Reality Check

What actually happened:

  • Phase 1: 6 weeks (not 4) due to authentication integration complexity
  • Phase 2: 10 weeks (not 4) after edge cases in rate limiting surfaced
  • Phase 3: Partially descoped based on actual usage patterns

The implementation series documents how to adapt the timeline while maintaining stakeholder trust.

Red Flags in Timelines

Watch for:

  • No buffer for discoveries (“Week 1: Implement everything”)
  • No testing time allocated
  • Dependency on other teams not accounted for
  • “Simple integration” with external services (it’s never simple)

Technical Considerations: The Reality Check Section

This section reveals whether the authors have actually built similar systems or are just good at reading blog posts.

Scalability That Matters

The RFC got specific about scale:

### Performance Targets
- Notification delivery: < 100ms for in-app, < 5s for email
- System throughput: 10,000+ notifications per second
- Database query performance: < 50ms for preference lookups

These aren’t arbitrary numbers. They’re derived from:

  • Current user base (10,000 notifications/second = peak load × 3)
  • User experience research (100ms feels instant)
  • Infrastructure constraints (database connection limits)

What Actually Got Measured

Six months in production:

  • P99 in-app delivery: 87ms
  • P99 email delivery: 3.2s
  • Peak throughput: 7,800/second (sufficient)
  • Database query P99: 124ms (needed optimization)

The analytics and optimization post details how these targets get met.

Security Considerations That Save Your Bacon

Good RFCs address:

  • Authentication: “JWT tokens with 15-minute expiry”
  • Authorization: “Role-based access with granular permissions”
  • Rate limiting: “Per-user limits with exponential backoff”
  • Data privacy: “PII encryption at rest, GDPR compliance”

A security incident can surface months in, when someone attempts to use the notification system for spam. A rate limiting strategy specified in the RFC prevents the platform from becoming an unwitting spam relay.

Testing Strategy: Beyond “We’ll Write Tests”

Testing sections reveal whether teams actually practice TDD or just talk about it.

Testing That Actually Happened

### Load Tests
- High-volume notification sending
- Concurrent user connections
- Database performance under load
- Queue processing capacity

What made this valuable:

  • Specific scenarios: Not just “load testing” but what specifically
  • Performance criteria: Clear pass/fail conditions
  • Tool selection: We used K6 for load testing, as suggested

Testing Gaps That Surfaced

The RFC missed:

  • Chaos testing (a Redis failure can surface in month 2)
  • Cross-browser WebSocket compatibility
  • Mobile app battery impact from persistent connections
  • International character set handling in templates

Good RFCs acknowledge that you can’t predict every test scenario but provide a framework for discovering what you missed.

Monitoring & Analytics: What You’ll Actually Look At

Most monitoring sections list every possible metric. Good ones identify the 3-5 metrics that actually indicate system health.

Metrics That Mattered

### Key Metrics
- Delivery success rate (target: 99.9%)
- Delivery time by channel
- User engagement rates
- Support ticket volume

After six months, these four metrics are the ones worth checking daily. Everything else is noise until something breaks.

Alert Fatigue Is Real

The RFC suggested alerting on:

  • High error rates (> 5%)
  • Delivery delays (> 10s)
  • System resource usage (> 80%)

What actually warrants an alert:

  • Delivery success rate < 99% (not 95%)
  • Email delivery P99 > 30s (not 10s)
  • Database connection pool exhaustion (not CPU usage)

The real-time delivery post explains how to tell what actually indicates problems versus normal variance.

Cost Analysis: The Budget Reality

This is where engineering meets business. Good cost sections acknowledge both immediate and ongoing costs.

Costs We Could Predict

### Infrastructure Costs
- Database: $200-500/month
- Message Queue: $50-150/month
- Push Notification Services: $0.50 per 1000 notifications

These were reasonably accurate because they’re based on published pricing.

Costs That Got Missed

  • CloudWatch logs: $300/month (we log everything)
  • S3 for notification archive: $150/month
  • Additional database read replica: $400/month
  • Engineering time for maintenance: 0.5 FTE ongoing

The total monthly cost ended up being ~$1,800 versus the $500-800 implied by the RFC. Still worth it, but stakeholders appreciate honesty about TCO.

ROI That Materialized

The RFC projected:

  • 20-30% reduction in support tickets
  • 5-15% increase in user retention

Actual results:

  • 23% reduction in support tickets
  • 8% increase in user retention
  • Unexpected win: 40% faster feature adoption

Risks & Mitigation: Honest Assessment

The best risk sections admit what the authors don’t know.

Risks That Materialized

Risk: Database performance degradation with high volume
Mitigation: Proper indexing, read replicas, query optimization

This risk absolutely materializes. At 2 million notifications, queries start timing out. The suggested mitigation works, but takes three weeks to implement properly.

Risks That Went Unanticipated

  • WebSocket connection limits in the load balancer
  • Template rendering performance with nested conditionals
  • Time zone edge cases for scheduled notifications
  • Mobile carriers blocking the SMS provider

Good RFCs acknowledge unknown unknowns and build in flexibility to handle them.

Success Criteria: Measurable Outcomes

This section is your contract with stakeholders. Make it measurable and realistic.

Criteria That Worked

### Technical Success
- 99.9% notification delivery success rate
- < 100ms in-app notification delivery
- System handles 10,000+ notifications per second

These were:

  • Measurable: Specific numbers, not “fast” or “reliable”
  • Achievable: Based on similar systems, not wishful thinking
  • Relevant: Tied directly to user experience

Moving Goalposts

What changed after launch:

  • Success rate target moved to 99.5% (99.9% was too expensive)
  • In-app delivery relaxed to 200ms (users couldn’t tell the difference)
  • Throughput requirement dropped to 5,000/second (actual peak load)

The key is documenting why criteria changed and getting stakeholder buy-in on adjustments.

The Reviews That Actually Matter

After reviewing hundreds of RFCs, here’s what different stakeholders actually care about:

What VPs/Directors Look For

  1. Executive summary that explains business value
  2. Cost analysis with clear ROI
  3. Timeline with milestone deliverables
  4. Risk section that doesn’t hide complexity

What Senior Engineers Look For

  1. Technical implementation that shows deep understanding
  2. Scalability considerations based on actual metrics
  3. Alternative approaches and why they were rejected
  4. Integration points with existing systems

What Team Leads Look For

  1. Implementation phases that deliver value iteratively
  2. Testing strategy that’s actually executable
  3. Success criteria their team can rally around
  4. Monitoring approach that won’t create alert fatigue

What Security Teams Look For

  1. Authentication/authorization approach
  2. Data privacy considerations
  3. Rate limiting and abuse prevention
  4. Audit trail capabilities

Lessons From Implementation

Six months after implementing the notification system, these are the areas the RFC should have emphasized more:

Documentation Is Part of the System

The RFC becomes the primary documentation. Structuring it explicitly as living documentation from the start pays dividends.

Migration Strategy Matters

The RFC focused on the new system but barely mentioned migrating from the old one. Migration can take 40% of total effort.

Operational Runbooks Save Lives

The RFC should include or mandate operational runbooks. The gap becomes obvious after the first production incident.

Feature Flags Are Your Friend

The RFC mentioned phased rollout but didn’t emphasize feature flags. They prevent rollbacks when issues emerge.

The RFC as a Living Document

The best RFCs aren’t abandoned after approval. They evolve into:

  • Architecture documentation
  • Onboarding materials for new team members
  • Decision logs for future reference
  • Post-mortem context when things go wrong

A well-maintained notification system RFC can accumulate dozens of commits post-approval, documenting every significant deviation and learning.

What Makes RFCs Actually Useful

Here’s what separates useful RFCs from bureaucratic exercises:

Write for Multiple Audiences

Your RFC has at least four audiences: executives, architects, implementers, and operators. Structure it so each can find what they need quickly.

Be Honest About Uncertainty

The best RFCs include sections titled “What We Don’t Know Yet” or “Assumptions That Might Be Wrong.”

Include Escape Hatches

Good RFCs explain not just how to build the system but how to back out if things go wrong. This paradoxically makes approval easier.

Make Success Measurable

Vague success criteria lead to endless debates. Specific numbers force clarity about what you’re actually trying to achieve.

Show Your Work

Include enough detail that another team could implement your design, but not so much that you’re writing the code in prose.

The Perfect RFC Doesn’t Exist

No RFC is perfect. The notification system RFC dissected in this post had significant gaps: it underestimated complexity, missed operational concerns, and was optimistic on timelines. But it succeeded where it mattered: it aligned stakeholders, guided implementation, and created a framework for iteration.

The best RFC isn’t the one that predicts everything perfectly. It’s the one that provides enough structure to start building, enough flexibility to adapt, and enough honesty to maintain trust when reality inevitably differs from the plan.

Final Thoughts: The RFC Paradox

A pattern holds across many organizations: the teams that write the best RFCs often need them the least. They have strong communication, clear thinking, and good engineering practices. The RFC is just a formalization of what they already do well.

Conversely, teams that struggle with RFCs often have deeper issues - unclear requirements, competing visions, or technical debt that makes any solution complex. The RFC becomes a forcing function for addressing these issues.

The notification system RFC succeeded not because it was perfect, but because it forced important conversations early. The implementation went smoothly not because the RFC predicted everything, but because it created a shared understanding of what was being built and why.

That’s the real value of a good RFC: it’s not about documenting the perfect solution, it’s about aligning everyone on a good-enough solution and providing a framework for making it better over time.

References

Related posts