Skip to content

Bufstream#

Bufstream is the Kafka-compatible message queue built for the data lakehouse era. Building on top of off-the-shelf technologies such as S3 and Postgres instead of expensive machines with large attached disks, Bufstream is 8x less expensive to operate than Apache Kafka. But that's not why we built it: Bufstream brings schema-driven development and schema governance to streaming data, solving the data quality problems that plague Kafka.

Paired with the Buf Schema Registry, Bufstream's broker-side schema awareness solves longstanding problems with data quality and schema governance while enabling new capabilities like semantic validation and direct-to-Iceberg topic storage.

First-class schema support#

If you send Protobuf payloads over Kafka, Bufstream goes far beyond being just a great Kafka replacement.

Bufstream's first-class schema support on the broker-side means it deeply understands your Protobuf payloads. Paired with the Buf Schema Registry, Bufstream provides centralized schema management, version control, and breaking change detection. Easy to run locally, Bufstream's a simple binary that works with local Buf CLI workspaces: it couldn't be easier to get started with schema-driven streaming development. As a drop-in Kafka replacement, it also works with your existing Confluent Schema Registries.

Broker-side schema awareness#

In traditional Kafka, brokers are simple data pipes; they have no understanding of what data traverses them. This simplicity helped Kafka gain ubiquity, however in practice, most data sent through Kafka topics has some schema that represents it. Understanding the schema of the payloads is critical to ensuring data quality. Unfortunately, in the Kafka ecosystem, this job is precariously left to clients, bolted on as an afterthought to an ecosystem not designed to understand schemas in the first place.

We think this is a broken model:

  • Your producer's client should be extremely simple: just post a raw message to your Kafka topic, and let the broker deal with the rest. By forcing clients to understand your schemas, you're requiring the ecosystem to maintain fat, hard-to-maintain Kafka clients across many languages. Their complexity means that language support suffers.
  • Systems should never rely on client-side enforcement! Client-side enforcement is in effect "opt-in" enforcement. Producers can choose to do it or not, meaning you have no guarantees as to the quality of data sent to your consumers.

Bufstream aims to flip the script. Instead, Bufstream brokers are schema-aware. Bufstream directly connects to your schema registry to understand the shape of your data across your topics. Broker-side schema-awareness is the foundation that makes everything else possible: semantic validation, build-time schema governance, and data quality guarantees that actually work.

Semantic validation#

Merely enforcing that your payloads match the expected shapes is rarely enough, but that's all other Kafka brokers can do. Fields have properties; you may want to make sure an int field is always between 0 and 100, or you may want to make sure that a string field always matches a regex or represents a valid email address. Bufstream pairs with Protovalidate to provide semantic validation of your fields on the fly. Semantic validation extends schema awareness, providing robust data quality guarantees never before seen in the Kafka ecosystem; your consumers can be confident that the data they process conforms to properties they expect. When a producer sends a semantically invalid message, the entire batch is rejected or the offending record is sent to a dead-letter queue.

Iceberg integration#

Bufstream's Iceberg integration directly writes topic data as Apache Iceberg™ tables: since Bufstream understands the shape of your data and stores its own data in object storage, Bufstream can persist records as Iceberg tables as it receives them, making them ready for consumption by your compute engine of choice.

This is transformative; instead of setting up a separate, expensive ETL pipeline to consume your records and produce separate Iceberg tables with extra storage, Bufstream takes care of this process on the fly. In effect, by saving the additional data copy, Bufstream effectively eliminates the cost of Kafka storage. Bufstream eliminates any effect of such ETL pipelines on your time-to-insight.

As a pure Kafka replacement#

As a pure Kafka replacement, Bufstream is also best-in-class. Compared to Apache Kafka, Bufstream:

  • Is 8x less expensive to operate, including Buf's licensing fees.
  • Scales throughput from zero to 100s of GB/s in a single cluster with no fuss with virtually zero maintenance.
  • Is active-active: Bufstream brokers are leaderless. Writes can happen to any broker in any zone, reducing networking fees. For GCP clusters, writes can happen to any broker in multiple regions without significantly affecting cost characteristics.

Bufstream is best-in-class of a new wave of object-store-based Kafka implementations that have come to market promising lower costs. As opposed to WarpStream, StreamNative Ursa, Confluent Freight, and others, Bufstream:

  • Can be completely self-hosted, just like Apache Kafka. Bufstream deployments do not talk to Buf the company in any way - we can even eliminate phone homes for billing! Competitors are either managed, or merely provide the ability for your data to be self-hosted while still-sensitive metadata continues to be sent back to their respective companies.
  • Is based completely on off-the-shelf primitives that can be deployed in all major clouds: AWS, GCP, and Azure. Bufstream needs a metadata store (Postgres, Google Cloud Spanner) and an object store (S3, GCS, Azure Blob Storage), and is otherwise off to the races. Deploying Bufstream is as simple as deploying a Helm chart. No bespoke, untested infrastructure to maintain.
  • Has undergone extensive correctness testing, including independent verification by Jepsen. You shouldn't just take our word that Bufstream is production-ready, and we've invested the resources to make sure the best in the business agree with Bufstream's performance under pressure.
  • Has the lowest cost and latency characteristics. While benchmarks can be massaged, we are confident after extensive testing that Bufstream leads the bunch.

Bufstream is the only enterprise-ready new-age Kafka implementation available today that is verified to be able to handle your production workloads.

One schema language for your entire stack#

With Bufstream, we're making it possible to use a single schema language across your entire stack. Using Protobuf schemas as the single source of truth, you can:

  • Define your schemas once and use them across your RPC framework, streaming data platform, and data lake tables
  • Enforce schema quality at build-time with the Buf Schema Registry, ensuring bad schema changes are blocked before they ever reach production
  • Validate data at the broker with Protovalidate, preventing semantically invalid data from ever entering your topics or data lake
  • Query your streaming data with direct-to-Iceberg topic storage, paying only once for both Kafka and data lake storage

Without proper schema governance, there can be no confidence in the underlying data traversing your systems. Bufstream and the BSR solve this once and for all, transitioning from perpetual cleanup to consistently trusted data.

If this is a world that interests you, we'd suggest: