Building a Reliable Event Bus on Serverless Infrastructure

Real-time event systems and serverless infrastructure are in fundamental tension.

A real-time event system wants long-lived connections. Redis pub/sub, WebSockets, SSE — all of them work by keeping a connection open and pushing data as it arrives. A serverless function lives for the duration of a single request, then dies. Vercel's default function timeout is 10 seconds; the Pro plan max is 60 seconds. No connection survives longer than that.

This isn't a showstopper — it's a constraint. WFW's event bus is designed around these constraints rather than fighting them. This post covers the specific mechanisms we use and the failure modes each one addresses.

The Core Problem: Pub/Sub Needs State; Serverless Functions Don't Have It

The standard Redis pub/sub pattern looks like this: a publisher emits an event to a channel, subscribers listening on that channel receive it. The subscriber must be connected and listening at the moment of publication. If it's not, the event is lost.

On serverless infrastructure, "connected and listening" is impossible. The function that would have been listening is not running. There is no persistent process.

The naive workaround — poll a database for new events — works but introduces latency and load proportional to polling frequency. Poll every second and you get acceptable latency with high DB load. Poll every 10 seconds and you get tolerable DB load with unacceptable latency for time-sensitive events like call.started or agent.activated.

WFW's approach is a hybrid: SSE with a flush pattern for immediate delivery, a Redis replay buffer for missed events, and Inngest for webhook delivery where long-running retries are needed.

The SSE Polling-with-Flush Pattern

SSE (Server-Sent Events) connections from a browser or bot consumer connect to a streaming endpoint at /v2/events. The serverless function accepts the connection and flushes events as they arrive. The function has a maximum lifetime of 25 seconds (deliberately set below Vercel's 30-second limit to leave cleanup headroom).

Client                          Vercel Serverless Function
  │                                        │
  │── GET /v2/events?since=0 ─────────────►│
  │                                        │  Connects to Redis
  │                                        │  Subscribes to channel
  │◄── event: agent.activated ────────────│  (flush as received)
  │◄── event: call.started ───────────────│
  │                                        │
  │                         [25s timeout]  │
  │◄── event: stream_end ─────────────────│  Signals reconnect
  │                                        │  [function dies]
  │
  │── GET /v2/events?since=last_event_id ─►│  (new function instance)
  │                                        │  Replays from buffer if gap

When the 25-second window ends, the function sends a streamend event with the ID of the last delivered event, then closes the connection. The client immediately reconnects with ?since=event_id>. The new function instance checks the Redis replay buffer for any events that fired between the last delivery and the new connection — and replays them before subscribing to the live channel.

The reconnect latency is typically 200–400ms. In that window, events can be missed. The replay buffer is what fills the gap.

The Redis Replay Buffer

Every event published to the WFW event bus is written to two places simultaneously: the Redis pub/sub channel (for live subscribers) and a Redis sorted set (for the replay buffer). This is the dual-write guarantee.

async function emitEvent(event: WFWEvent): Promise<void> {
  const serialized = JSON.stringify(event);

  // Dual-write: pub/sub channel + replay buffer
  // Promise.allSettled ensures both writes are attempted even if one fails
  const [pubResult, bufferResult] = await Promise.allSettled([
    redis.publish(`events:${event.businessId}`, serialized),
    redis.zadd(
      `replay:${event.businessId}`,
      { score: event.timestamp_ms, member: serialized }
    )
  ]);

  // Log partial failures — event was emitted but may not be replayable
  if (pubResult.status === 'rejected') {
    logger.error('pub/sub emit failed', { eventId: event.id, error: pubResult.reason });
  }
  if (bufferResult.status === 'rejected') {
    logger.error('replay buffer write failed', { eventId: event.id, error: bufferResult.reason });
  }
}

The sorted set uses the event timestamp as the score. Replay queries use ZRANGEBYSCORE to fetch events after a given timestamp:

// Fetch events after last_event_id's timestamp
const missed = await redis.zrangebyscore(
  `replay:${businessId}`,
  lastEventTimestamp + 1,
  '+inf',
  { limit: { offset: 0, count: 100 } }  // max 100 events per replay
);

The replay buffer retains 100 events per client and expires after 24 hours. If a client has been disconnected for more than 24 hours, it can't replay — it needs to fetch current state via the REST API and treat itself as a fresh subscriber.

Why Inngest for Webhook Delivery

Webhooks have different reliability requirements than SSE streams. An SSE subscriber that misses events can replay from the buffer. A webhook that fails needs to be retried — potentially for hours, if the subscriber's endpoint is down.

Redis pub/sub is unsuitable for this. If the subscriber's webhook endpoint returns a 500, there's no built-in retry mechanism. You'd need to build one: store pending deliveries, track retry state, implement exponential backoff, handle dead letters. That's a message queue, and building one on top of Redis sorted sets is a non-trivial project.

WFW uses Inngest for webhook delivery. When an event is emitted, the emitter sends it to both Redis (for SSE subscribers) and Inngest (for webhook subscribers):

// Webhook subscribers get Inngest delivery
await inngest.send({
  name: 'wfw/event.emit',
  data: {
    event,
    webhookSubscriptions: await getWebhookSubscriptions(event.businessId, event.type)
  }
});

Inngest handles the delivery, retry logic (exponential backoff, up to 24 hours), dead letter queue, and delivery logs. The WFW codebase doesn't implement any of that — it's delegated entirely to Inngest.

The tradeoff: Inngest adds ~300ms of delivery latency vs. direct HTTP. For webhook delivery, that's acceptable. For SSE streams, it wouldn't be — which is why the two mechanisms coexist.

Failure Modes and Recovery

Missed events (reconnect gap). The SSE reconnect with ?since= + replay buffer handles this for normal operation. Worst case: 400ms reconnect × event rate ≈ low single-digit missed events that are replayed immediately.

Replay buffer overflow. If a client reconnects after receiving more than 100 events during its disconnection, the buffer returns only the most recent 100. The client receives them and may have gaps. The correct recovery: call GET /v2/agents/{id} or the relevant resource endpoint to get current state, then resume streaming from the current timestamp.

Redis pub/sub failure. If Redis is unavailable when an event is emitted, Promise.allSettled means the event is still attempted on both channels. If both fail, the event is logged but not delivered. WFW emits a system.eventdeliveryfailed event (via a secondary Redis connection) for observability. Clients polling the REST API will see the state change; real-time delivery is degraded but data isn't lost.

Inngest delivery failure. Inngest retries for 24 hours. If the subscriber's endpoint doesn't recover within 24 hours, the event lands in the dead letter queue and is visible in the Inngest dashboard. WFW surfaces dead-letter counts in the wfw-admin monitoring panel.

The gap between Redis write and Inngest write. If the process crashes between the two writes, one succeeds and one doesn't. The dual-write uses Promise.allSettled specifically to ensure both are attempted, but an application crash mid-function can still cause partial delivery. This is a known gap; the current tolerance is that SSE and webhook consumers will have slightly different event completeness in crash scenarios. A future improvement (Sprint 24) would write to a durable queue first and fan out from there.

The Serverless Event Bus in Summary

EVENT EMITTED
    │
    ├──► Redis pub/sub channel ──► SSE subscribers (live connection)
    │         │
    │         └──► Redis sorted set (replay buffer, 24h / 100 events)
    │
    └──► Inngest ──► Webhook subscribers (with retry, dead letter)

This architecture is not as elegant as a dedicated message broker. It's the architecture that runs reliably on Vercel serverless without a persistent server process, without infrastructure outside of what Vercel + Redis + Inngest already provide, and without building a custom retry system. The constraints are real. The solutions are specific to those constraints.

This concludes the Developer Deep Dives series. Series 4 continues with Adding a Voice Agent to Your SaaS in 3 API Calls — the practical integration guide for SaaS builders.