We've been hard at work on Bufstream, our drop-in replacement for Apache Kafka® rebuilt on top of S3-compatible object storage. It's one of the new breed of object-storage-based Kafka replacements, which seem to have become a dime a dozen. WarpStream kicked off the race in 2023, quickly followed by Bufstream, StreamNative Ursa, and Confluent Freight. In the coming months, even Redpanda is finally getting into the game, and there's a proposal to add support for object storage to Apache Kafka itself in the coming years.
Why the surge? The pitch is simple: S3 replicates across availability zones for free. By using S3 as your backing store for your topics, you can eliminate associated inter-zone networking costs, massively reducing your Kafka spend. The trade-off is latency: S3 is slower than local disks, and most object-storage-based Kafka replacements will have a p99 end-to-end latency in the 500-1000ms range. If you can tolerate this (as almost all Kafka users can), then save the money. Along the way, benefit from a leaderless broker design: write any partition to any broker.
If you're interested in these cost savings, we're convinced that Bufstream is (by far) your best option. We're happy to chat more, but here's the bullet points:
If you're looking for a modern drop-in replacement for Apache Kafka to save costs and complexity, Bufstream is probably your best bet. We're happy to go head-to-head against any competitor, and we're confident we can win your business (candidly, in head-to-head POCs with our competitors, we usually do). We're proud of what we've built, but cost savings are generally a race to the bottom, and are not why we got into the Kafka game. We've got a bigger mission here, one that leads us back to where Buf started.
At Buf, we're driving a shift towards universal schema adoption, a world where you:
Engineers shouldn't have to define their network APIs in OpenAPI or Protobuf, their streaming data types in Avro, and their data lake schemas in SQL. Engineers should be able to represent every property they care about directly on their schema, and have these properties propagated throughout their RPC framework, streaming data platform, and data lake tables.
A unified schema approach can dramatically reshape data engineering:
The largest data engineering pain point — poor data quality — can be solved, transitioning from perpetual cleanup to consistently trusted data. Data engineers can stop being data quality QA personnel and get back to their jobs.
While what specific schema language is chosen is somewhat unimportant in theory, at Buf, we think it should be Protobuf:
CREATE TABLE
statements. Given SQL's widespread use in big data, this is useful at one end of the spectrum. However, SQL is not a schema language appropriate for all parts of your stack; you'd never use CREATE TABLE
statements to describe the shape of your RPCs and there's no tooling to do so. SQL just isn't great for structured data: nested types and lists need to be projected into sub-tables, and the mapping to language-specific objects or structs is less than obvious.While Protobuf is far from perfect, Protobuf is the most battle-tested, widely-used schema language in existence today. If you're looking to use a schema language anywhere across your stack and in any language, there's probably a Protobuf library you can use (and we may have written it). Protobuf also has a well-defined JSON mapping, which remains critical for human introspection and migratory use cases.
The world has moved to Protobuf in the last decade, and that transition doesn't look to be slowing down.
Adopting schemas across your stack has historically been a story of fragmentation and frustration. You'd have to use different schema languages at different parts of your stack. REST/JSON dominated the network API space, and fighting against the tide had a huge cost. With the rise of gRPC, Protobuf became the clear alternative by the late 2010s, however Protobuf development left a lot to be desired. To effectively adopt Protobuf, you'd have to solve compilation, stub generation, distribution, enforcement of common standards, breaking change prevention, documentation, and the list goes on. At best, you'd get CLI tooling seemingly designed in 1970, and perhaps a little bit of documentation. Early adopters had to cobble together patchwork solutions to these problems, which rarely rose to the challenge.
Buf brought together the world's Protobuf experts to solve this once and for all:
buf
is your one-stop shop for anything Protobuf. The Buf CLI has become the de facto standard for local Protobuf development across the industry.All built at Buf, to make Protobuf work for everyone.
So where does Bufstream fit in?
Streaming data has a major problem with data quality, namely we have no guarantees of the quality of data being produced. This comes down to typical streaming data architecture. In traditional Kafka, brokers are simple data pipes; brokers have no understanding of what data traverses them. This simplicity helped Kafka gain ubiquity, however in practice, most data that is sent through Kafka topics has some schema that represents it.
Unfortunately, in the Kafka ecosystem, schema validation is precariously left to clients, bolted on as an afterthought to an ecosystem not designed to understand schemas in the first place. Client-side enforcement is in effect "opt-in" enforcement. Producers can choose to do it or not, meaning you have no guarantees as to the quality of data sent to your consumers. This is a state of the world we'd never accept in i.e. network APIs – imagine if your application servers relied on your web clients to validate their data and your applications persisted whatever they were given – we'd all be in trouble!
Bufstream is more than just a drop-in Kafka replacement. Bufstream is built from the ground up to understand the shape of the data traversing it's topics. We call this broker-side schema awareness, and it brings some interesting capabilities. Chief among these is its ability to block bad data from entering topics in the first place.
Bufstream provides governed topics that enable semantic validation via Protovalidate on the producer API. If a record is produced with a message that doesn't pass validation, the entire batch is rejected or the offending record is sent to a DLQ. Importantly, since this happens on the broker, consumers can rely on the knowledge that data within topics always matches its stated constraints.
It's a tale as old as time: a required field is zeroed out, or some data is corrupted, and a downstream business intelligence dashboard is subtly wrong for days. The maintainer eventually realizes, and yells at the Kafka team for their data quality issues. The Kafka team, however, had nothing to do with it – they don't control the producers of the data. Everyone scrambles to find the lineage of the bad data until order is restored. Bufstream solves this once and for all: this tale is a thing of the past with broker-side semantic validation.
Bufstream's awareness of your schemas provides so much more, from direct mapping to Iceberg tables with zero copies (your Iceberg tables are your Kafka storage), to a type-safe transformation engine that's dramatically more performant than any stream data processor in existence. We'll cover these in specific blog posts in the future.
It isn't enough to ensure that bad data for your current schemas doesn't proliferate. You also need to ensure that bad schema changes don't make it to production either. Deleting fields, changing their type, or adding backwards-incompatible semantic properties all can result in downstream consumers being hopelessly broken without any recourse. In almost all cases, breaking schema changes should never hit your network APIs, Kafka topics, or Iceberg tables, until you do a proper v2.
Consumers need the confidence that producers will never break their schemas until v2 (usually, in an entirely new topic or table), but current practices do not incentivize proper schema management and evolution. Schemas are typically shared via a schema registry, such as the Confluent Schema Registry or Glue Schema Registry. Unfortunately, new schemas are registered with these schema registries at runtime via clients that provide whatever schemas are baked into their code. These schemas have no guarantee of compatibility or having done through proper review - they could even appear from dev laptops from code on feature branches in the worst case.
Here's a typical flow for a producer using the Confluent Schema Registry (CSR):
This is a recipe for disaster. The CSR's checks for compatibility are basic, and don't take semantic properties into account. For Protobuf, the CSR doesn't check all properties that must be checked to ensure true Protobuf compatibility (a fact we'll dive into in a future post). Schemas can appear at runtime without any vetting.
Buf introduces a different world with the Buf Schema Registry (BSR). Schemas cannot appear out of thin air, instead only being allowed to appear at build-time via explicit pushes from source control after passing stringent breaking change and policy checks. Buf will check not only basic properties, but semantic properties as well via Protovalidate. And Buf has the world's Protobuf experts - when we validate that your schemas have no breaking changes, we mean it. Schemas are code reviewed by relevant teams, just like any other piece of code. In the same flow as above:
Without this proper schema governance, there can be no confidence in the underlying data traversing your systems, and consumers have to stay on their toes.
Buf brings a holistic approach to this problem. We're making it possible to use a single schema language across your entire stack with ease. Given the following Protobuf message:
message User {
option (buf.kafka.v1.topic) = "user-created";
option (buf.kafka.v1.topic) = "user-updated";
option (buf.validate.message).cel = {
expression: "!has(this.first_name) || has(this.last_name)"
};
string id = 1 [
(buf.validate.field).string.uuid = true,
(acme.option.v1.safe_for_ai) = true
];
string handle = 2 [
(buf.validate.field).string.min_len = 1,
(buf.validate.field).string.max_len = 64,
(acme.option.v1.safe_for_ai) = true
];
string first_name = 3 [
(buf.validate.field).string.min_len = 1,
(buf.validate.field).string.max_len = 64,
(buf.rbac.v1.field).role = "pii",
(acme.option.v1.safe_for_ai) = false
];
string last_name = 4 [
(buf.validate.field).string.min_len = 1,
(buf.validate.field).string.max_len = 64,
(buf.rbac.v1.field).role = "pii",
(acme.option.v1.safe_for_ai) = false
];
string email = 5 [
(buf.validate.field).required = true,
(buf.validate.field).string.email = true,
(buf.rbac.v1.field).role = "pii",
(acme.option.v1.safe_for_ai) = false
];
uint32 age = 6 [
(buf.validate.field).uint32.lte = 150,
(buf.rbac.v1.field).role = "pii",
(acme.option.v1.safe_for_ai) = true
];
}
You should be able to:
User
safely and easily in your IDE of choice, using Buf's tools to enforce that changes to User
comply with your style guide and policies. For example, you may want to make sure that every field has a safe_for_ai
annotation, noting whether or not it is safe to train AI models on this field.User
do not introduce any breaking changes or policy violations. Bad changes to User
will be blocked at build-time, and never allowed to propagate to generated code, Kafka topics, or data lakes.User
in any language without needing to understand Protobuf or its toolchain.Users
from ever making it down your stack via Protovalidate. Your RPC framework should have interceptors at the application layer to enforce the properties of Users
, and your Kafka-compatible message queue should either reject malformed Users
via the Producer API, or send them to a DLQ. No bad data should ever again enter your topics or data lake.Users
produced to the user-created
and user-updated
Kafka topics into Iceberg tables in your data lake to be queried within seconds of production, while paying only once for both Kafka and data lake storage. Consumers of your Iceberg tables can be confident that the data they consume will always be correct, and the backing schema will never be broken.And so much more. If this is a world that interests you, get in touch, we'd love to get to work.