2025-09-08
Building a Scalable User Notification System: Architecture and Database Design
Design patterns, database schemas, and architectural decisions for building enterprise notification systems that handle millions of users
A notification feature starts as “send an email when X happens” and turns into a multi-channel delivery problem within a quarter: email, SMS, push, and in-app, each with its own delivery guarantees, retry semantics, and user-preference surface. The architectural mistake is treating this as a template-plus-send problem; the actual problem is a router that has to decide (per user, per channel, per event) whether to fan out, coalesce, suppress, or defer, and has to keep that decision auditable for compliance and support.
This post is part 1 of a series on building a production notification system. It covers the event-driven architecture (producers, router, dispatcher), the database schema for events, preferences, and delivery state, the channel router pattern, and the observability surface that makes failures debuggable before they reach the user.
The Hidden Complexity of “Simple” Notifications
The initial mental model for notifications is: trigger event → send message → done. The production reality is a complex orchestration of user preferences, delivery channels, rate limiting, retry logic, template management, analytics tracking, and regulatory compliance.
The complexity becomes apparent during a first major product launch. With 10,000 users suddenly receiving welcome emails, password resets, and activity notifications simultaneously, the email service starts throttling, the database connection pool maxes out, and users start complaining about duplicate notifications.
System Architecture
The architecture below has been validated across different scales and industries. Every component exists because something broke in production without it.
Event-Driven Architecture
Notifications are not request-response operations. They are fire-and-forget events that require asynchronous processing. The following event structure works reliably across multiple systems:
interface NotificationEvent {
id: string;
userId: string;
type: NotificationType;
templateId?: string;
data: Record<string, any>;
priority: 'low' | 'normal' | 'high' | 'critical';
scheduledAt?: Date;
expiresAt?: Date;
metadata: {
source: string;
correlationId: string;
retryCount: number;
maxRetries: number;
};
}
enum NotificationType {
PROJECT_UPDATE = 'project_update',
SECURITY_ALERT = 'security_alert',
FEATURE_ANNOUNCEMENT = 'feature_announcement',
SYSTEM_MAINTENANCE = 'system_maintenance',
USER_ACTIVITY = 'user_activity',
INTEGRATION_UPDATE = 'integration_update'
}
The metadata section is crucial: the correlation ID enables tracing notification flows across distributed systems and is essential for debugging delivery failures.
The Notification Engine: Heart of the System
The notification engine is where most of the complexity lives. The following implementation reflects several iterations of refinement:
class NotificationEngine {
constructor(
private eventBus: EventBus,
private templateService: TemplateService,
private preferenceManager: PreferenceManager,
private rateLimiter: RateLimiter,
private channelRouter: ChannelRouter,
private analytics: AnalyticsService
) {}
async processEvent(event: NotificationEvent): Promise<void> {
try {
// Check if user exists and is active
const user = await this.getUserWithPreferences(event.userId);
if (!user?.isActive) {
await this.analytics.trackSkipped(event.id, 'user_inactive');
return;
}
// Apply user preferences filtering
const enabledChannels = await this.preferenceManager
.getEnabledChannels(event.userId, event.type);
if (enabledChannels.length === 0) {
await this.analytics.trackSkipped(event.id, 'all_channels_disabled');
return;
}
// Rate limiting check
const rateLimitResult = await this.rateLimiter
.checkLimits(event.userId, event.type);
if (!rateLimitResult.allowed) {
await this.scheduleRetry(event, rateLimitResult.retryAfter);
return;
}
// Process each enabled channel
const deliveryPromises = enabledChannels.map(channel =>
this.processChannel(event, channel, user)
);
const results = await Promise.allSettled(deliveryPromises);
await this.analytics.trackDeliveryResults(event.id, results);
} catch (error) {
await this.handleProcessingError(event, error);
}
}
private async processChannel(
event: NotificationEvent,
channel: NotificationChannel,
user: User
): Promise<DeliveryResult> {
// Template rendering with user data
const template = await this.templateService.getTemplate(
event.type,
channel,
user.locale
);
const renderedContent = await this.templateService.render(
template,
{ ...event.data, user }
);
// Route to appropriate channel handler
return await this.channelRouter.deliver(
channel,
user,
renderedContent,
event.metadata
);
}
}
The key insight here: every operation can fail, and you need to handle failures gracefully while maintaining visibility into what’s happening.
Database Design: The Foundation That Makes or Breaks You
The schema below has been refined across multiple iterations and has held up under production-scale load:
Core Tables
-- Users table (assuming it exists)
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL,
phone VARCHAR(20),
locale VARCHAR(10) DEFAULT 'en',
timezone VARCHAR(50) DEFAULT 'UTC',
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
-- The preference system - this gets complex fast
CREATE TABLE notification_preferences (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id) ON DELETE CASCADE,
notification_type VARCHAR(100) NOT NULL,
channel VARCHAR(50) NOT NULL,
enabled BOOLEAN DEFAULT true,
frequency VARCHAR(20) DEFAULT 'immediate', -- immediate, daily, weekly
quiet_hours_start TIME DEFAULT '22:00:00',
quiet_hours_end TIME DEFAULT '08:00:00',
metadata JSONB DEFAULT '{}', -- for channel-specific settings
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(user_id, notification_type, channel)
);
-- Template management - localization is crucial
CREATE TABLE notification_templates (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
notification_type VARCHAR(100) NOT NULL,
channel VARCHAR(50) NOT NULL,
locale VARCHAR(10) DEFAULT 'en',
subject VARCHAR(500),
body TEXT NOT NULL,
variables JSONB DEFAULT '{}', -- expected variables
is_active BOOLEAN DEFAULT true,
version INTEGER DEFAULT 1,
created_by UUID REFERENCES users(id),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(notification_type, channel, locale, version)
);
Event Storage and Tracking
Event storage is the area most prone to scaling surprises. The schema below addresses the most common failure modes:
-- Main event table - this gets HUGE
CREATE TABLE notification_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id) ON DELETE CASCADE,
notification_type VARCHAR(100) NOT NULL,
template_id UUID REFERENCES notification_templates(id),
priority VARCHAR(20) DEFAULT 'normal',
data JSONB DEFAULT '{}',
scheduled_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
expires_at TIMESTAMP WITH TIME ZONE,
status VARCHAR(20) DEFAULT 'pending',
processed_at TIMESTAMP WITH TIME ZONE,
correlation_id VARCHAR(255), -- for tracing
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
INDEX idx_notification_events_user_created (user_id, created_at DESC),
INDEX idx_notification_events_status (status, scheduled_at),
INDEX idx_notification_events_correlation (correlation_id)
);
-- Delivery tracking - separate for performance
CREATE TABLE notification_deliveries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id UUID REFERENCES notification_events(id) ON DELETE CASCADE,
channel VARCHAR(50) NOT NULL,
status VARCHAR(20) DEFAULT 'pending', -- pending, sent, delivered, failed, bounced
attempt_count INTEGER DEFAULT 0,
max_attempts INTEGER DEFAULT 3,
next_retry_at TIMESTAMP WITH TIME ZONE,
sent_at TIMESTAMP WITH TIME ZONE,
delivered_at TIMESTAMP WITH TIME ZONE,
failed_at TIMESTAMP WITH TIME ZONE,
error_code VARCHAR(50),
error_message TEXT,
provider_id VARCHAR(255), -- external provider message ID
provider_response JSONB,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
INDEX idx_deliveries_event_channel (event_id, channel),
INDEX idx_deliveries_retry (status, next_retry_at) WHERE status = 'pending'
);
-- Analytics aggregation table - learned this the hard way
CREATE TABLE notification_metrics (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
date DATE NOT NULL,
hour SMALLINT NOT NULL, -- 0-23
notification_type VARCHAR(100) NOT NULL,
channel VARCHAR(50) NOT NULL,
status VARCHAR(20) NOT NULL,
count INTEGER DEFAULT 1,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(date, hour, notification_type, channel, status)
);
Performance Lessons from Production
Here are the indexing strategies that actually matter when you’re processing millions of notifications:
-- Critical indexes based on query patterns
CREATE INDEX idx_events_user_type_created
ON notification_events(user_id, notification_type, created_at DESC);
CREATE INDEX idx_events_processing_queue
ON notification_events(status, scheduled_at)
WHERE status IN ('pending', 'retry');
CREATE INDEX idx_deliveries_retry_queue
ON notification_deliveries(next_retry_at, status)
WHERE status = 'pending' AND next_retry_at IS NOT NULL;
-- Partial indexes for common queries
CREATE INDEX idx_events_recent_active
ON notification_events(created_at DESC)
WHERE created_at > NOW() - INTERVAL '7 days';
-- For analytics queries
CREATE INDEX idx_metrics_time_type
ON notification_metrics(date, hour, notification_type);
The partial indexes are crucial. Without them, your analytics queries will start timing out when you hit millions of events.
User Preferences: More Complex Than You Think
User preferences seem straightforward until you hit edge cases. Here’s the preference manager that’s handled real-world complexity:
class PreferenceManager {
async getEnabledChannels(
userId: string,
notificationType: string
): Promise<NotificationChannel[]> {
// Check global user preferences
const userPrefs = await this.db.query(`
SELECT np.channel, np.enabled, np.frequency,
np.quiet_hours_start, np.quiet_hours_end,
u.timezone
FROM notification_preferences np
JOIN users u ON u.id = np.user_id
WHERE np.user_id = $1 AND np.notification_type = $2
`, [userId, notificationType]);
if (userPrefs.length === 0) {
// Use default preferences for this notification type
return this.getDefaultChannels(notificationType);
}
const currentTime = new Date();
const enabledChannels: NotificationChannel[] = [];
for (const pref of userPrefs) {
if (!pref.enabled) continue;
// Check quiet hours
if (this.isInQuietHours(currentTime, pref)) {
// Check if this is a critical notification that overrides quiet hours
if (!this.isCriticalNotification(notificationType)) {
continue;
}
}
// Check frequency preferences
if (!this.shouldSendBasedOnFrequency(userId, pref.frequency, notificationType)) {
continue;
}
enabledChannels.push(pref.channel as NotificationChannel);
}
return enabledChannels;
}
private isInQuietHours(currentTime: Date, pref: any): boolean {
// Convert current time to user's timezone
const userTime = moment(currentTime)
.tz(pref.timezone || 'UTC')
.format('HH:mm:ss');
const quietStart = pref.quiet_hours_start;
const quietEnd = pref.quiet_hours_end;
// Handle quiet hours that cross midnight
if (quietStart > quietEnd) {
return userTime >= quietStart || userTime <= quietEnd;
}
return userTime >= quietStart && userTime <= quietEnd;
}
}
Timezone handling alone requires multiple iterations to get right. User preferences become significantly more complex in a global application than they first appear.
Template System: Localization and Personalization
Templates are where the rubber meets the road for user experience. Here’s the template service that handles localization, personalization, and A/B testing:
interface Template {
id: string;
name: string;
type: string;
channel: string;
locale: string;
subject?: string;
body: string;
variables: Record<string, TemplateVariable>;
abTest?: ABTestConfig;
}
class TemplateService {
async getTemplate(
notificationType: string,
channel: NotificationChannel,
locale: string = 'en'
): Promise<Template> {
// Try to get localized template first
let template = await this.db.findTemplate({
type: notificationType,
channel,
locale,
isActive: true
});
// Fallback to English if no localized version
if (!template && locale !== 'en') {
template = await this.db.findTemplate({
type: notificationType,
channel,
locale: 'en',
isActive: true
});
}
if (!template) {
throw new Error(`No template found for ${notificationType}/${channel}/${locale}`);
}
return template;
}
async render(template: Template, data: Record<string, any>): Promise<RenderedContent> {
try {
// Validate required variables
await this.validateTemplateData(template, data);
// Process template with Handlebars or similar
const subject = template.subject
? await this.renderString(template.subject, data)
: undefined;
const body = await this.renderString(template.body, data);
return {
subject,
body,
templateId: template.id,
locale: template.locale
};
} catch (error) {
// Log template rendering errors for debugging
await this.logger.error('Template rendering failed', {
templateId: template.id,
error: error.message,
data: this.sanitizeDataForLogging(data)
});
throw new TemplateRenderError(`Failed to render template ${template.id}`, error);
}
}
}
Rate Limiting: Protecting Users and Providers
Rate limiting balances user experience with system stability. The following implementation covers per-user, per-type limits with atomic Redis checks:
interface RateLimitConfig {
notificationType: string;
channel: string;
limits: {
perMinute: number;
perHour: number;
perDay: number;
};
burstAllowance: number;
}
class RateLimiter {
constructor(private redis: Redis, private configs: RateLimitConfig[]) {}
async checkLimits(
userId: string,
notificationType: string
): Promise<RateLimitResult> {
const config = this.getConfig(notificationType);
if (!config) {
return { allowed: true, remainingToday: Infinity };
}
const now = Date.now();
const keys = {
minute: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 60000)}`,
hour: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 3600000)}`,
day: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 86400000)}`
};
// Use Redis pipeline for atomic checks
const pipeline = this.redis.pipeline();
pipeline.incr(keys.minute);
pipeline.expire(keys.minute, 60);
pipeline.incr(keys.hour);
pipeline.expire(keys.hour, 3600);
pipeline.incr(keys.day);
pipeline.expire(keys.day, 86400);
const results = await pipeline.exec();
const counts = {
minute: results[0][1] as number,
hour: results[2][1] as number,
day: results[4][1] as number
};
// Check against limits
if (counts.minute > config.limits.perMinute ||
counts.hour > config.limits.perHour ||
counts.day > config.limits.perDay) {
return {
allowed: false,
retryAfter: this.calculateRetryAfter(counts, config),
remainingToday: Math.max(0, config.limits.perDay - counts.day)
};
}
return {
allowed: true,
remainingToday: config.limits.perDay - counts.day
};
}
}
Key Lessons for Production Systems
Systems handling millions of messages daily surface the same lessons repeatedly:
-
Start with idempotency: Every notification operation should be idempotent. Users notice duplicates more acutely than missing notifications.
-
Design for observability: Debugging delivery issues consumes more time than building features. Correlation IDs and detailed logging are not optional.
-
Separate concerns early: A notification engine that grows into a monolith becomes hard to scale. Each channel should be independently deployable.
-
Plan for data retention: Notification data grows quickly. A retention and archiving strategy is needed from day one.
-
User preferences are complex: A simple on/off toggle becomes timezone-aware, frequency-based, channel-specific preferences with quiet hours and emergency overrides.
The next part of this series covers real-time delivery mechanisms: WebSocket connections, push notifications, and the channel-specific implementations. It also covers why retry logic and circuit breakers are required, not optional, in a production notification service.
References
- What is Amazon SNS? - AWS Documentation - Overview of Amazon Simple Notification Service as the pub/sub backbone for multi-channel notification fanout architecture
- Common Amazon SNS scenarios - AWS Documentation - Fanout, mobile push, and application-to-application messaging patterns that underpin scalable notification architectures
- Using dead-letter queues in Amazon SQS - AWS Documentation - SQS dead-letter queue configuration for isolating and replaying failed notification deliveries
- Firebase Cloud Messaging - firebase.google.com - Official FCM documentation covering multi-platform push notification delivery for Android, iOS, and web
- Sending notification requests to APNs - Apple Developer - Apple’s official guide for transmitting push notification payloads to iOS and macOS devices via APNs
The foundation outlined here may seem over-engineered for a simple notification system. When debugging why 50,000 users did not receive their password reset emails during a product launch, every piece of observability and resilience built into this design proves its value.
Building a Scalable User Notification System
A comprehensive 4-part series covering the design, implementation, and production challenges of building enterprise-grade notification systems. From architecture and database design to real-time delivery, debugging at scale, and performance optimization.
All posts in this series
Related posts
Comprehensive guide to choosing the right database for your project - covering SQL, NoSQL, NewSQL, and edge solutions with real-world implementation stories and performance benchmarks.
What Aurora Serverless v2 actually is under the hood: the shared Aurora storage layer, the ACU-driven compute layer, the Caspian heat-management substrate, scale-to-zero mechanics, and mixed-mode clusters.
A deep dive into building SaaS authorization with AWS Cognito and Verified Permissions. Covers Cedar policy language, multi-tenant patterns, JWT token flow, cost analysis, and common mistakes with TypeScript examples.
A vendor-neutral evaluation of external authorization platforms including AWS Verified Permissions, SpiceDB, OpenFGA, Cerbos, and OPA. Covers architecture patterns, cost analysis, and a decision framework for engineering teams.
A deep technical comparison of Cedar, Rego, OpenFGA DSL, and Cerbos YAML/CEL policy languages. Covers syntax, performance benchmarks, formal verification, tooling, and integration patterns with TypeScript examples for each language.