Bufstream — Buf’s drop-in replacement for Apache Kafka® — now supports business-critical workloads with multi-region, active-active clusters on GCP. Unlike other solutions, multi-region Bufstream clusters scale without limit, easily tolerate full region outages, add no operational complexity, and have clear SLAs. And at just $2.3M/month for 100 GiB/s of writes and 300 GiB/s of reads, multi-region Bufstream is 3x cheaper and infinitely more operable than a self-hosted Apache Kafka stretch cluster.
As legacy software moves to the cloud, most systems simply treat cloud availability zones as on-premises racks or data centers. For example, the typical Apache Kafka deployment runs in a single cloud region, treating each availability zone as a rack. Using this approach, legacy systems can tolerate small cloud outages — for example, the typical Kafka deployment runs in three zones and can tolerate a single-zone outage without much fuss.
Unfortunately, these small outages are just the tip of the iceberg. While less common than single-zone outages, whole cloud regions go down with some regularity: AWS, GCP, and Azure each had a full-region outage in early 2023. To mitigate the effects of large-scale outages, highly resilient systems must span multiple regions. For legacy software, this is a significant challenge.
Today, Kafka is commonly used for business-critical streaming workloads like transaction processing, fraud detection, and dynamic pricing. Naturally, businesses want these functions to continue uninterrupted during large cloud outages, so we need a multi-region Kafka architecture. The ideal multi-region Kafka deployment would:
In short, we’d like multi-region deployments to behave like single-region clusters — just better.
The Apache Kafka community has been wrestling with multi-region Kafka for more than a decade, starting with the initial version of MirrorMaker and continuing through the current discussion of KIP-986. Today, the most widely-used solutions are MirrorMaker 2, Confluent Cluster Linking, and stretch clusters. Judged by the criteria above, all three are terrible.
MirrorMaker 2 builds on Kafka Connect to asynchronously replicate topics and consumer group offsets from a source cluster to a destination cluster. Confluent Cluster Linking is conceptually similar, but doesn’t require Kafka Connect. Under both systems, replicated topics and consumer groups are read-only in the destination cluster. These asynchronous replicators deliver none of the capabilities we want:
Stretch clusters take a completely different approach. Rather than asynchronously replicating data between independent clusters, they simply spread a single Kafka cluster across regions. This comes with notable operational drawbacks, but it does deliver some of our desired capabilities:
Neither of these options are attractive. Asynchronous replicators require manual failovers, can orphan an unbounded amount of data during outages, and complicate application code. Stretch clusters offer true active-active multi-region capabilities, but are operationally challenging. Both are unappealingly expensive.
By adopting a leaderless, diskless architecture, Bufstream can do much better. Even in single-region deployments, Bufstream brokers are stateless and communicate only with brokers in the same availability zone. All inter-zone communication goes through object storage and the cluster’s metadata backend. To expand a Bufstream cluster from a single region to multiple regions, we need a consistent, multi-region metadata backend and object storage bucket.
Google Cloud Platform offers both. Bufstream already supports Spanner as a metadata backend, so switching to a multi-region Spanner cluster doesn’t require any special code. As a metadata system, multi-region Spanner is unmatched: it’s fully consistent, stores data in multiple regions before acknowledging writes, and has a 99.999% availability SLA. And because Bufstream puts very little load on its metadata backend, Spanner typically accounts for less than 1% of cluster costs.
Google Cloud also offers dual- and multi-region Cloud Storage buckets. These buckets offer interesting guarantees: object metadata is fully consistent and synchronously stored in multiple regions, but the object data is replicated asynchronously. If clients try to fetch not-yet-replicated objects, the data is automatically fetched from the source region, stored locally, and returned to the client. This is perfect for Bufstream:
Just by switching Bufstream to multi-region Cloud Storage and Spanner, we satisfy most of our multi-region requirements:
To prove these cost and performance claims, we ran our largest workload on a Bufstream cluster running in two regions. Built with the Open Messaging Benchmark Framework, this workload creates 100 GiB/s of uncompressed writes and 300 GiB/s of uncompressed reads. With 4:1 client-side compression, this shrinks to 25 GiB/s of writes and 75 GiB/s of reads on the wire. We split the workload evenly, running half in each region.
The Bufstream cluster required to run this workload is similar to our single-region setup. We used a cluster of 108 n2d-standard-32
brokers, each of which has 32 vCPUs and 128 GiB of memory. We distributed the cluster across six availability zones, three in the us-west1
region and three in us-west2
. We used a multi-region GCS bucket in the us
multi-region as the cluster’s primary storage backend. For metadata storage, we configured Bufstream to use a 9-node multi-region Spanner cluster with read-write nodes in us-west1
and us-west2
and witness nodes in us-west3
.
This cluster serves our enormous workload with aplomb. On average, the Bufstream brokers use just a third of the available vCPUs and half the available memory — the same performance profile as a single-region cluster serving an identical workload.
Backed by GCS and Spanner, Bufstream handles this load with a median end-to-end latency of 450 milliseconds and a p99 of 850 milliseconds.
A single-region Apache Kafka cluster handling this load would be challenging. A stretch cluster is unthinkable. But with a few configuration changes, we’re able to convert a single-region Bufstream cluster to an active-active, multi-region cluster that easily handles this scale. And best of all, going multi-region doesn’t add any operational load. Google Cloud’s SRE teams are carrying the pager for Spanner and Cloud Storage, leaving us responsible just for the stateless, autoscaling Bufstream brokers. Our clients don’t need to think about complex multi-region replication topologies either — they can write to any topic and join any consumer group, in one region or both, and trust that their application code will continue to function correctly.
The Bufstream cluster handling this workload costs $2,358,673 per month: $1,840,273 of infrastructure costs and $518,400 of fees. Assuming a 1 year commitment, deployment in three zones of us-west1
and three zones of us-west2
, and 7 days of retention:
n2d-standard-32
Bufstream brokers, half in us-west1
and half in us-west2
, cost $74,015 per month. Tier 1 networking, required for guaranteed network throughput, adds an additional $38,972 per month.A comparable Apache Kafka stretch cluster is impossible to operate and still costs $7,527,268 per month, 3x more than Bufstream — even with fetch-from-follower and tiered storage enabled. Again assuming a 1 year commitment and 7 days of retention:
us-west1
, us-west2
, and us-west3
) with a replication factor of 3. We can’t effectively deploy a stretch cluster in only two regions because Kafka’s only abstraction for fault domains is a rack. To ensure that our cluster splits partition replicas between regions, we must configure the cluster to ignore availability zones and treat regions as our “racks.” But once we do that, Kafka will happily place multiple replicas in the same zone. As a result, a stretch cluster in two regions will often place all replicas for a partition in just two zones (one in each region). After all the expense and trouble of running a stretch cluster, that’s worse zone diversity than a single-region cluster!n2-standard-48
brokers, split evenly between regions and each with 37 TiB of attached Hyperdisk, at a cost of $942,148 per month. With tiered storage, we assume that a cost-conscious deployment would keep 12 hours of data on disk with an additional 12 hour buffer. After 4x compression and 3x replication (two replicas in each region), our 100 GiB/s workload would require 6.2 PiB of storage. For argument’s sake, we spread this over 171 brokers, so each broker needs about 37 TiB of storage. Note that the n4
instance family we used in our single-region benchmarks isn’t yet available in these regions, so we’re forced to use the slightly more expensive n2
family.At this scale, an Apache Kafka stretch cluster is more of a theoretical exercise than a practical option — even the most talented infrastructure team would struggle to operate this cluster, scale up during outages, and scale back down afterwards.
Among all the streaming data solutions in the market, only Bufstream makes truly robust, active-active deployments practical. Multi-region Bufstream clusters scale without limit, handle single-region outages automatically, add no operational complexity, and have clear data replication SLAs. And despite all that, Bufstream clusters are still less expensive than self-managed, single-region Apache Kafka clusters. There’s no other product in the market that even comes close.
If your business depends on streaming data, we’d love to make your workloads bulletproof. You can get a feel for Bufstream with our interactive demo, dig into our smaller-scale benchmarks and cost analysis, or chat with us in the Buf Slack. For production deployments or to schedule a demo with our team, reach out to us directly!