2025-09-08

Building a Scalable User Notification System: Architecture and Database Design

Design patterns, database schemas, and architectural decisions for building enterprise notification systems that handle millions of users

A notification feature starts as “send an email when X happens” and turns into a multi-channel delivery problem within a quarter: email, SMS, push, and in-app, each with its own delivery guarantees, retry semantics, and user-preference surface. The architectural mistake is treating this as a template-plus-send problem; the actual problem is a router that has to decide (per user, per channel, per event) whether to fan out, coalesce, suppress, or defer, and has to keep that decision auditable for compliance and support.

This post is part 1 of a series on building a production notification system. It covers the event-driven architecture (producers, router, dispatcher), the database schema for events, preferences, and delivery state, the channel router pattern, and the observability surface that makes failures debuggable before they reach the user.

The Hidden Complexity of “Simple” Notifications

The initial mental model for notifications is: trigger event → send message → done. The production reality is a complex orchestration of user preferences, delivery channels, rate limiting, retry logic, template management, analytics tracking, and regulatory compliance.

The complexity becomes apparent during a first major product launch. With 10,000 users suddenly receiving welcome emails, password resets, and activity notifications simultaneously, the email service starts throttling, the database connection pool maxes out, and users start complaining about duplicate notifications.

System Architecture

The architecture below has been validated across different scales and industries. Every component exists because something broke in production without it.

Event-Driven Architecture

Notifications are not request-response operations. They are fire-and-forget events that require asynchronous processing. The following event structure works reliably across multiple systems:

interface NotificationEvent {
  id: string;
  userId: string;
  type: NotificationType;
  templateId?: string;
  data: Record<string, any>;
  priority: 'low' | 'normal' | 'high' | 'critical';
  scheduledAt?: Date;
  expiresAt?: Date;
  metadata: {
    source: string;
    correlationId: string;
    retryCount: number;
    maxRetries: number;
  };
}

enum NotificationType {
  PROJECT_UPDATE = 'project_update',
  SECURITY_ALERT = 'security_alert', 
  FEATURE_ANNOUNCEMENT = 'feature_announcement',
  SYSTEM_MAINTENANCE = 'system_maintenance',
  USER_ACTIVITY = 'user_activity',
  INTEGRATION_UPDATE = 'integration_update'
}

The metadata section is crucial: the correlation ID enables tracing notification flows across distributed systems and is essential for debugging delivery failures.

The Notification Engine: Heart of the System

The notification engine is where most of the complexity lives. The following implementation reflects several iterations of refinement:

class NotificationEngine {
  constructor(
    private eventBus: EventBus,
    private templateService: TemplateService,
    private preferenceManager: PreferenceManager,
    private rateLimiter: RateLimiter,
    private channelRouter: ChannelRouter,
    private analytics: AnalyticsService
  ) {}

  async processEvent(event: NotificationEvent): Promise<void> {
    try {
      // Check if user exists and is active
      const user = await this.getUserWithPreferences(event.userId);
      if (!user?.isActive) {
        await this.analytics.trackSkipped(event.id, 'user_inactive');
        return;
      }

      // Apply user preferences filtering
      const enabledChannels = await this.preferenceManager
        .getEnabledChannels(event.userId, event.type);
      
      if (enabledChannels.length === 0) {
        await this.analytics.trackSkipped(event.id, 'all_channels_disabled');
        return;
      }

      // Rate limiting check
      const rateLimitResult = await this.rateLimiter
        .checkLimits(event.userId, event.type);
      
      if (!rateLimitResult.allowed) {
        await this.scheduleRetry(event, rateLimitResult.retryAfter);
        return;
      }

      // Process each enabled channel
      const deliveryPromises = enabledChannels.map(channel => 
        this.processChannel(event, channel, user)
      );

      const results = await Promise.allSettled(deliveryPromises);
      await this.analytics.trackDeliveryResults(event.id, results);

    } catch (error) {
      await this.handleProcessingError(event, error);
    }
  }

  private async processChannel(
    event: NotificationEvent,
    channel: NotificationChannel,
    user: User
  ): Promise<DeliveryResult> {
    // Template rendering with user data
    const template = await this.templateService.getTemplate(
      event.type,
      channel,
      user.locale
    );

    const renderedContent = await this.templateService.render(
      template,
      { ...event.data, user }
    );

    // Route to appropriate channel handler
    return await this.channelRouter.deliver(
      channel,
      user,
      renderedContent,
      event.metadata
    );
  }
}

The key insight here: every operation can fail, and you need to handle failures gracefully while maintaining visibility into what’s happening.

Database Design: The Foundation That Makes or Breaks You

The schema below has been refined across multiple iterations and has held up under production-scale load:

Core Tables

-- Users table (assuming it exists)
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    phone VARCHAR(20),
    locale VARCHAR(10) DEFAULT 'en',
    timezone VARCHAR(50) DEFAULT 'UTC',
    is_active BOOLEAN DEFAULT true,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- The preference system - this gets complex fast
CREATE TABLE notification_preferences (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id) ON DELETE CASCADE,
    notification_type VARCHAR(100) NOT NULL,
    channel VARCHAR(50) NOT NULL,
    enabled BOOLEAN DEFAULT true,
    frequency VARCHAR(20) DEFAULT 'immediate', -- immediate, daily, weekly
    quiet_hours_start TIME DEFAULT '22:00:00',
    quiet_hours_end TIME DEFAULT '08:00:00',
    metadata JSONB DEFAULT '{}', -- for channel-specific settings
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    UNIQUE(user_id, notification_type, channel)
);

-- Template management - localization is crucial
CREATE TABLE notification_templates (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,
    notification_type VARCHAR(100) NOT NULL,
    channel VARCHAR(50) NOT NULL,
    locale VARCHAR(10) DEFAULT 'en',
    subject VARCHAR(500),
    body TEXT NOT NULL,
    variables JSONB DEFAULT '{}', -- expected variables
    is_active BOOLEAN DEFAULT true,
    version INTEGER DEFAULT 1,
    created_by UUID REFERENCES users(id),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    UNIQUE(notification_type, channel, locale, version)
);

Event Storage and Tracking

Event storage is the area most prone to scaling surprises. The schema below addresses the most common failure modes:

-- Main event table - this gets HUGE
CREATE TABLE notification_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id) ON DELETE CASCADE,
    notification_type VARCHAR(100) NOT NULL,
    template_id UUID REFERENCES notification_templates(id),
    priority VARCHAR(20) DEFAULT 'normal',
    data JSONB DEFAULT '{}',
    scheduled_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    expires_at TIMESTAMP WITH TIME ZONE,
    status VARCHAR(20) DEFAULT 'pending',
    processed_at TIMESTAMP WITH TIME ZONE,
    correlation_id VARCHAR(255), -- for tracing
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    INDEX idx_notification_events_user_created (user_id, created_at DESC),
    INDEX idx_notification_events_status (status, scheduled_at),
    INDEX idx_notification_events_correlation (correlation_id)
);

-- Delivery tracking - separate for performance
CREATE TABLE notification_deliveries (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_id UUID REFERENCES notification_events(id) ON DELETE CASCADE,
    channel VARCHAR(50) NOT NULL,
    status VARCHAR(20) DEFAULT 'pending', -- pending, sent, delivered, failed, bounced
    attempt_count INTEGER DEFAULT 0,
    max_attempts INTEGER DEFAULT 3,
    next_retry_at TIMESTAMP WITH TIME ZONE,
    sent_at TIMESTAMP WITH TIME ZONE,
    delivered_at TIMESTAMP WITH TIME ZONE,
    failed_at TIMESTAMP WITH TIME ZONE,
    error_code VARCHAR(50),
    error_message TEXT,
    provider_id VARCHAR(255), -- external provider message ID
    provider_response JSONB,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    INDEX idx_deliveries_event_channel (event_id, channel),
    INDEX idx_deliveries_retry (status, next_retry_at) WHERE status = 'pending'
);

-- Analytics aggregation table - learned this the hard way
CREATE TABLE notification_metrics (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    date DATE NOT NULL,
    hour SMALLINT NOT NULL, -- 0-23
    notification_type VARCHAR(100) NOT NULL,
    channel VARCHAR(50) NOT NULL,
    status VARCHAR(20) NOT NULL,
    count INTEGER DEFAULT 1,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    UNIQUE(date, hour, notification_type, channel, status)
);

Performance Lessons from Production

Here are the indexing strategies that actually matter when you’re processing millions of notifications:

-- Critical indexes based on query patterns
CREATE INDEX idx_events_user_type_created 
ON notification_events(user_id, notification_type, created_at DESC);

CREATE INDEX idx_events_processing_queue 
ON notification_events(status, scheduled_at) 
WHERE status IN ('pending', 'retry');

CREATE INDEX idx_deliveries_retry_queue 
ON notification_deliveries(next_retry_at, status) 
WHERE status = 'pending' AND next_retry_at IS NOT NULL;

-- Partial indexes for common queries
CREATE INDEX idx_events_recent_active 
ON notification_events(created_at DESC) 
WHERE created_at > NOW() - INTERVAL '7 days';

-- For analytics queries
CREATE INDEX idx_metrics_time_type 
ON notification_metrics(date, hour, notification_type);

The partial indexes are crucial. Without them, your analytics queries will start timing out when you hit millions of events.

User Preferences: More Complex Than You Think

User preferences seem straightforward until you hit edge cases. Here’s the preference manager that’s handled real-world complexity:

class PreferenceManager {
  async getEnabledChannels(
    userId: string, 
    notificationType: string
  ): Promise<NotificationChannel[]> {
    
    // Check global user preferences
    const userPrefs = await this.db.query(`
      SELECT np.channel, np.enabled, np.frequency, 
             np.quiet_hours_start, np.quiet_hours_end,
             u.timezone
      FROM notification_preferences np
      JOIN users u ON u.id = np.user_id
      WHERE np.user_id = $1 AND np.notification_type = $2
    `, [userId, notificationType]);

    if (userPrefs.length === 0) {
      // Use default preferences for this notification type
      return this.getDefaultChannels(notificationType);
    }

    const currentTime = new Date();
    const enabledChannels: NotificationChannel[] = [];

    for (const pref of userPrefs) {
      if (!pref.enabled) continue;

      // Check quiet hours
      if (this.isInQuietHours(currentTime, pref)) {
        // Check if this is a critical notification that overrides quiet hours
        if (!this.isCriticalNotification(notificationType)) {
          continue;
        }
      }

      // Check frequency preferences
      if (!this.shouldSendBasedOnFrequency(userId, pref.frequency, notificationType)) {
        continue;
      }

      enabledChannels.push(pref.channel as NotificationChannel);
    }

    return enabledChannels;
  }

  private isInQuietHours(currentTime: Date, pref: any): boolean {
    // Convert current time to user's timezone
    const userTime = moment(currentTime)
      .tz(pref.timezone || 'UTC')
      .format('HH:mm:ss');

    const quietStart = pref.quiet_hours_start;
    const quietEnd = pref.quiet_hours_end;

    // Handle quiet hours that cross midnight
    if (quietStart > quietEnd) {
      return userTime >= quietStart || userTime <= quietEnd;
    }

    return userTime >= quietStart && userTime <= quietEnd;
  }
}

Timezone handling alone requires multiple iterations to get right. User preferences become significantly more complex in a global application than they first appear.

Template System: Localization and Personalization

Templates are where the rubber meets the road for user experience. Here’s the template service that handles localization, personalization, and A/B testing:

interface Template {
  id: string;
  name: string;
  type: string;
  channel: string;
  locale: string;
  subject?: string;
  body: string;
  variables: Record<string, TemplateVariable>;
  abTest?: ABTestConfig;
}

class TemplateService {
  async getTemplate(
    notificationType: string,
    channel: NotificationChannel,
    locale: string = 'en'
  ): Promise<Template> {
    
    // Try to get localized template first
    let template = await this.db.findTemplate({
      type: notificationType,
      channel,
      locale,
      isActive: true
    });

    // Fallback to English if no localized version
    if (!template && locale !== 'en') {
      template = await this.db.findTemplate({
        type: notificationType,
        channel,
        locale: 'en',
        isActive: true
      });
    }

    if (!template) {
      throw new Error(`No template found for ${notificationType}/${channel}/${locale}`);
    }

    return template;
  }

  async render(template: Template, data: Record<string, any>): Promise<RenderedContent> {
    try {
      // Validate required variables
      await this.validateTemplateData(template, data);

      // Process template with Handlebars or similar
      const subject = template.subject 
        ? await this.renderString(template.subject, data)
        : undefined;

      const body = await this.renderString(template.body, data);

      return {
        subject,
        body,
        templateId: template.id,
        locale: template.locale
      };

    } catch (error) {
      // Log template rendering errors for debugging
      await this.logger.error('Template rendering failed', {
        templateId: template.id,
        error: error.message,
        data: this.sanitizeDataForLogging(data)
      });
      
      throw new TemplateRenderError(`Failed to render template ${template.id}`, error);
    }
  }
}

Rate Limiting: Protecting Users and Providers

Rate limiting balances user experience with system stability. The following implementation covers per-user, per-type limits with atomic Redis checks:

interface RateLimitConfig {
  notificationType: string;
  channel: string;
  limits: {
    perMinute: number;
    perHour: number;
    perDay: number;
  };
  burstAllowance: number;
}

class RateLimiter {
  constructor(private redis: Redis, private configs: RateLimitConfig[]) {}

  async checkLimits(
    userId: string, 
    notificationType: string
  ): Promise<RateLimitResult> {
    
    const config = this.getConfig(notificationType);
    if (!config) {
      return { allowed: true, remainingToday: Infinity };
    }

    const now = Date.now();
    const keys = {
      minute: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 60000)}`,
      hour: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 3600000)}`,
      day: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 86400000)}`
    };

    // Use Redis pipeline for atomic checks
    const pipeline = this.redis.pipeline();
    pipeline.incr(keys.minute);
    pipeline.expire(keys.minute, 60);
    pipeline.incr(keys.hour);
    pipeline.expire(keys.hour, 3600);
    pipeline.incr(keys.day);
    pipeline.expire(keys.day, 86400);

    const results = await pipeline.exec();
    const counts = {
      minute: results[0][1] as number,
      hour: results[2][1] as number,
      day: results[4][1] as number
    };

    // Check against limits
    if (counts.minute > config.limits.perMinute ||
        counts.hour > config.limits.perHour ||
        counts.day > config.limits.perDay) {
      
      return {
        allowed: false,
        retryAfter: this.calculateRetryAfter(counts, config),
        remainingToday: Math.max(0, config.limits.perDay - counts.day)
      };
    }

    return {
      allowed: true,
      remainingToday: config.limits.perDay - counts.day
    };
  }
}

Key Lessons for Production Systems

Systems handling millions of messages daily surface the same lessons repeatedly:

Start with idempotency: Every notification operation should be idempotent. Users notice duplicates more acutely than missing notifications.
Design for observability: Debugging delivery issues consumes more time than building features. Correlation IDs and detailed logging are not optional.
Separate concerns early: A notification engine that grows into a monolith becomes hard to scale. Each channel should be independently deployable.
Plan for data retention: Notification data grows quickly. A retention and archiving strategy is needed from day one.
User preferences are complex: A simple on/off toggle becomes timezone-aware, frequency-based, channel-specific preferences with quiet hours and emergency overrides.

The next part of this series covers real-time delivery mechanisms: WebSocket connections, push notifications, and the channel-specific implementations. It also covers why retry logic and circuit breakers are required, not optional, in a production notification service.

References

What is Amazon SNS? - AWS Documentation - Overview of Amazon Simple Notification Service as the pub/sub backbone for multi-channel notification fanout architecture
Common Amazon SNS scenarios - AWS Documentation - Fanout, mobile push, and application-to-application messaging patterns that underpin scalable notification architectures
Using dead-letter queues in Amazon SQS - AWS Documentation - SQS dead-letter queue configuration for isolating and replaying failed notification deliveries
Firebase Cloud Messaging - firebase.google.com - Official FCM documentation covering multi-platform push notification delivery for Android, iOS, and web
Sending notification requests to APNs - Apple Developer - Apple’s official guide for transmitting push notification payloads to iOS and macOS devices via APNs

The foundation outlined here may seem over-engineered for a simple notification system. When debugging why 50,000 users did not receive their password reset emails during a product launch, every piece of observability and resilience built into this design proves its value.

Building a Scalable User Notification System

A comprehensive 4-part series covering the design, implementation, and production challenges of building enterprise-grade notification systems. From architecture and database design to real-time delivery, debugging at scale, and performance optimization.

Progress 1 of 4 posts

First post in this series

Next Real-time Notifications and Multi-Channel Delivery

All posts in this series

Part 1: Architecture and Database Design

Part 2: Real-time Notifications and Multi-Channel Delivery

Part 3: Production War Stories: Debugging at Scale

Part 4: Analytics and Performance Optimization

View series →

Database Selection Guide: From Classical to Edge - A Complete Engineering Perspective

Comprehensive guide to choosing the right database for your project - covering SQL, NoSQL, NewSQL, and edge solutions with real-world implementation stories and performance benchmarks.

databasepostgresqlmysql+8

September 4, 2025

Aurora Serverless v2 in Review: An Architectural Deep Dive

What Aurora Serverless v2 actually is under the hood: the shared Aurora storage layer, the ACU-driven compute layer, the Caspian heat-management substrate, scale-to-zero mechanics, and mixed-mode clusters.

awsauroraaurora-serverless-v2+4

May 25, 2026

AWS Cognito + Verified Permissions for SaaS Authorization

A deep dive into building SaaS authorization with AWS Cognito and Verified Permissions. Covers Cedar policy language, multi-tenant patterns, JWT token flow, cost analysis, and common mistakes with TypeScript examples.

authorizationawscognito+4

March 22, 2026

External Authorization Management Systems: Choosing the Right Platform for Your Architecture

A vendor-neutral evaluation of external authorization platforms including AWS Verified Permissions, SpiceDB, OpenFGA, Cerbos, and OPA. Covers architecture patterns, cost analysis, and a decision framework for engineering teams.

authorizationsecurityarchitecture+5

March 22, 2026

Cedar vs Rego vs OpenFGA: Policy Language Comparison

A deep technical comparison of Cedar, Rego, OpenFGA DSL, and Cerbos YAML/CEL policy languages. Covers syntax, performance benchmarks, formal verification, tooling, and integration patterns with TypeScript examples for each language.

authorizationsecurityarchitecture+3

March 22, 2026

The Hidden Complexity of “Simple” Notifications

System Architecture

Event-Driven Architecture

The Notification Engine: Heart of the System

Database Design: The Foundation That Makes or Breaks You

Core Tables

Event Storage and Tracking

Performance Lessons from Production

User Preferences: More Complex Than You Think

Template System: Localization and Personalization

Rate Limiting: Protecting Users and Providers

Key Lessons for Production Systems

References

Building a Scalable User Notification System

All posts in this series

Related posts