Bufstream quickstart#
What is Bufstream?#
Bufstream is a fully self-hosted drop-in replacement for Apache Kafka® that writes data to S3-compatible object storage. It’s 100% compatible with the Kafka protocol, including support for exactly-once semantics (EOS) and transactions. Bufstream is 8x cheaper to operate, and a single cluster can elastically scale to hundreds of GB/s of throughput. It's the universal Kafka replacement for the modern age.
But Bufstream solves problems that Kafka can't. With broker-side semantic validation, Protovalidate rules guarantee bad data never reaches your consumers—the broker blocks it before it corrupts your downstream systems. And with direct Apache Iceberg™ integration, your Kafka topics are simultaneously queryable data lakehouse tables. No pipelines. No data errors. Produce a message, run a SQL query. That's it.
In this quickstart, you'll learn how to run Bufstream locally, block invalid data with semantic validation, and query streaming data as Iceberg tables.
Run the broker#
Bufstream brokers are simple binaries. Let's get one running locally!
Download the Bufstream broker:
# Bufstream brokers are only available for Mac and Linux!
curl -sSL -o bufstream \
"https://buf.build/dl/bufstream/latest/bufstream-$(uname -s)-$(uname -m)" && \
chmod +x bufstream
Run the Bufstream broker in local mode:
Log lines similar to this should print out (time and level fields stripped):
Congratulations! Bufstream is running locally at localhost:9092.
If you normally use Kafka tools like AKHQ, Redpanda Console™, or kafkactl, feel free to use any of them to connect, create topics, and test producing or consuming messages.
That's all it takes to run Bufstream locally as a drop-in Kafka replacement. But Bufstream's built for more than cost savings and compatibility. Let's see what makes it different.
🛑 Stop the broker (ctrl-c) before continuing.
Kafka's data problem#
Streaming data has a major problem: it provides no data quality guarantees. In traditional Kafka, brokers are simple data pipes with no understanding of what data traverses them. This simplicity helped Kafka gain ubiquity, but it created a data quality problem.
Most data sent through Kafka topics has a schema—a shape—but validation is precariously left to clients. Client-side enforcement is "opt-in" enforcement. Producers can choose to do it or not, meaning you have no guarantees about the quality of data reaching your consumers. We'd never accept this for network APIs: imagine if your application servers relied on web clients to validate their data and persisted whatever they were given. We'd all be in trouble!
Let's set up a simple example, where we'll consume a stream of thousands of e-commerce shopping carts, some of which have problematic data.
You'll need Go installed. If you're on a Mac and using Homebrew, this is as easy as:
We'll run all the examples here in the context of the demo repository. Stop the Bufstream broker and clone the repository:
Move the broker into bufstream-demo then enter that directory:
Open two more terminals in bufstream-demo, for a total of three:
- One for the broker.
- A second, where you'll launch producers.
- A third, where you'll launch consumers.
In the three terminals open to the bufstream-demo directory, run:
go run ./cmd/bufstream-demo-consume \
--topic orders \
--group order-verifier
From the consumer, you should see something similar to the following (time and level fields stripped):
msg="starting consume"
msg="received a Cart with a zero-quantity LineItem"
msg="received 250 carts"
msg="received a Cart with a zero-quantity LineItem"
About 1% of the messages produced are invalid—they send carts where some of the items have a quantity value of zero, violating semantic validation rules in the schema, a division by zero error waiting to happen!
But this is how traditional Kafka works: the broker is just a data pipe with no understanding of what traverses it. Bad data flows right through to your consumers, a recipe for data quality disasters downstream.
🛑 Stop the producer and consumer (ctrl-c) before continuing.
Schemas and semantic validation#
Bufstream uses two concepts to guarantee data quality:
Broker-side schema awareness: Bufstream understands the shape of the data traversing its topics. When configured with a schema provider (which we just did with --schema .), Bufstream can access Protobuf schemas. When you map a message type to a topic, the broker knows what to expect.
Semantic validation: Fields in a schema are more than just data types. They have semantic properties: they can be optional or required, email addresses must match a specific format, and numbers like quantities must be positive. Bufstream's governed topics enable semantic validation via Protovalidate. When producers send data, the broker validates it. If a record doesn't pass validation, you can choose to reject the entire batch or send the offending record to a dead-letter topic.
This is a fundamental shift from traditional Kafka, where schema validation is left to clients (an "opt-in" enforcement model where producers can choose to do it or not). Bufstream's broker-side semantic validation guarantees topic data meets your business rules and domain logic, eliminating an entire class of data errors before they materialize in downstream systems.
Map messages to topics#
Remember when we started Bufstream with --schema .? That made Protobuf schemas in the example repository, like Cart, available to the broker. Notice its validation rules: line_items must have at least one entry, and every LineItem's quantity must be greater than zero. These are semantic properties that go beyond just type checking, representing real business logic:
// Cart is a collection of goods or services sold to a customer.
message Cart {
// line_items represent individual items on this cart. A valid Cart must have
// at least one and no more than 1,000 items.
repeated LineItem line_items = 4 [
(buf.validate.field).repeated.min_items = 1,
(buf.validate.field).repeated.max_items = 1000,
];
}
// LineItem is an individual good or service added to a cart.
message LineItem {
// quantity is the unit count of the good or service provided.
uint64 quantity = 3 [
// Do not allow zero-quantity LineItems.
(buf.validate.field).uint64.gt = 0
];
}
Bufstream can use Protobuf schemas from a variety of sources:
- Buf Schema Registry modules
- Local Protobuf files in a Buf workspace
- Any Buf input, like a FileDescriptorSet compiled by
protoc - A Confluent Schema Registry
Let's tell Bufstream which message type the orders topic should contain. Set the buf.registry.value.schema.message topic property to Cart's fully qualified name:
./bufstream kafka config topic set \
--topic orders \
--name buf.registry.value.schema.message \
--value bufstream.demo.v1.Cart
Reject invalid messages#
Now that orders is schema aware, configure it to reject batches containing invalid messages:
./bufstream kafka config topic set \
--topic orders \
--name bufstream.validate.mode \
--value reject
Restart the producer and consumer:
go run ./cmd/bufstream-demo-produce \
--topic orders
go run ./cmd/bufstream-demo-consume \
--topic orders \
--group order-verifier
Once the consumer reaches the newest offsets, you'll see that there are no more invalid cart messages! Check the producer and broker terminals: you'll see error messages about validation failures. The broker is now blocking bad data before it reaches your consumers.
The reject mode is great for greenfield development where you can fix producers immediately, but it might be too disruptive for existing systems.
Use a dead-letter queue#
A better approach for many scenarios is to use a dead-letter queue (DLQ), where invalid messages are routed to a separate topic for inspection and reprocessing. Stop the producer, switch bufstream.validate.mode to dlq, then restart the producer:
./bufstream kafka config topic set \
--topic orders \
--name bufstream.validate.mode \
--value dlq
Check all your terminals. No producer or consumer errors! In the broker terminal, you can see Bufstream logging validation failures as it stores invalid messages in a DLQ topic (orders.dlq).
If you want to see the DLQ in action, start another terminal and consume it:
go run ./cmd/bufstream-demo-consume-dlq \
--topic orders.dlq
This is broker-side semantic validation at work: guaranteed data quality with zero changes to your applications. Unlike traditional Kafka where validation is left to clients (if they choose to do it at all), Bufstream ensures only valid data reaches your consumers. Always.
Clean up#
🛑 Use ctrl-c to stop the broker, producer, consumer, and the optional DLQ consumer before continuing.
Now that we've guaranteed orders contains quality records, let's see how to analyze them with Iceberg integration.
Iceberg integration#
Traditional streaming architectures require layers of data pipelines between your message queue and your data lakehouse. To see who's buying what in our e-commerce system, you'd need ETL jobs, schema mapping, data quality checks, and orchestration—all adding latency, complexity, and cost.
Bufstream eliminates this entirely. Because it understands your schemas, Bufstream can store topic data as Apache Iceberg™ tables. Zero copies, no pipelines. Your Kafka topics are also queryable tables in your data lakehouse.
Run the Compose project#
In production, Bufstream needs an object store (S3, GCS, or Azure Blob Storage) and a metadata store (Postgres or Cloud Spanner). The repository's Docker Compose project at iceberg/docker-compose.yaml provides this with MinIO and Postgres, adding an Iceberg REST catalog and Apache Spark with Jupyter Notebooks.
Make sure you've stopped your Bufstream broker, producer, and consumer. In your broker terminal, start the iceberg/docker-compose.yaml Compose project. This requires large downloads and may take a few minutes.
Wait until you see output like the following before continuing:
create-orders-topic-1 | config has been altered
create-orders-topic-1 exited with code 0
This new, empty orders topic has been configured for DLQ mode.
Additionally, the three necessary topic configuration values for Iceberg have been set for you:
- bufstream.archive.kind: Is set to
ICEBERG. - bufstream.archive.iceberg.catalog: Is set to
local-rest-catalog, the name of a catalog iniceberg/bufstream.yaml. - bufstream.archive.iceberg.table: Is set to
bufstream.orders.
Create sample data#
Run the producer to create sample data:
Query Iceberg tables#
Iceberg isn't a database. It's a table format for analytical databases. To query your tables, you can use any Iceberg-compatible query engine like Apache Spark, Google's BigQuery, AWS Athena, ClickHouse, or Trino. For this quickstart, we'll use a local Apache Spark instance with Jupyter Notebooks.
In production, Bufstream periodically updates your Iceberg catalog. For local development, you can run the update manually using Bufstream's admin clean topics command. Stop the producer then run the clean command:
You should see progress messages like the following: our order data is ready for analytics!
╭─ Job Result: clean-topics
├─ ✓ clean-topics [#########################] 100% (4/4)
╰─ ✓ resolve-partitions [#########################] 100% (4/4)
Open the provided Jupyter Notebook at http://localhost:8888/notebooks/notebooks/bufstream-quickstart.ipynb. Find the query cell starting with this SQL and click in it, then either click the ▶︎ icon or use shift-return to execute it:
%%sql
WITH categories AS (
SELECT
line_item.product.category.name AS category,
SUM(line_item.quantity * line_item.unit_price_cents) / 100.0 AS revenue,
COUNT(DISTINCT val.cart_id) AS carts,
SUM(line_item.quantity) AS units_sold
FROM `bufstream`.`orders`
LATERAL VIEW EXPLODE(val.line_items) AS line_item
GROUP BY category
)
-- Complete SQL omitted for brevity.
The results show aggregated sales data by category:
| category | revenue | carts | units_sold |
|---|---|---|---|
| Books & Stationery | 51,325,913.90 | 433,509 | 1,902,010 |
| Electronics & Accessories | 66,595,794.45 | 478,268 | 2,283,255 |
| Home & Garden | 74,389,867.55 | 478,944 | 2,288,845 |
| Kitchen & Dining | 71,785,506.09 | 433,954 | 1,909,491 |
| Personal Care | 35,022,047.75 | 378,871 | 1,523,025 |
| Sports & Outdoors | 88,873,545.59 | 478,196 | 2,281,141 |
| TOTAL | 387,992,675.33 | 738,859 | 12,187,767 |
Your exact numbers will vary—the data's random—but you're now querying real streaming data from Bufstream topics stored as Iceberg tables with guaranteed quality and zero data pipelines.
Clean up#
Use ctrl-c to stop the Compose project, then remove all containers with docker compose down.
All of their data is in the iceberg/data directory and can be deleted.
Wrapping up#
This is only scratching the surface. Explore the Bufstream docs to learn more, and if operational simplicity, valid data, and saving a ton of money is interesting to you, get in touch.