Bufstream Apache Iceberg integration#
Bufstream streams data from topics to Apache Iceberg™ tables, eliminating the need for complex and costly ETL pipelines. Bufstream offers two distinct modes of Iceberg integration, each with different tradeoffs:
Iceberg Archives (zero-copy)#
Bufstream’s Iceberg Archives mode uses a zero-copy architecture: as data flows through the broker, Bufstream transforms it to Parquet format and layers Iceberg metadata on top. The Parquet files stored in object storage serve as a single source of truth—used both to serve Kafka consumers and as data files in the Iceberg table. No separate copy of the data is made for analytics.
This mode is ideal when:
- You want to query your Kafka data using analytics tools like Apache Spark™, Amazon Athena™, Dremio®, Trino™, or Starburst™.
- You want object storage to be the single source of truth, without duplicating data.
- You want Iceberg table retention to be automatically tied to your Kafka topic retention policy.
Iceberg Export (continuous export)#
Bufstream’s Iceberg Export mode copies data from a Kafka topic into a separate Iceberg table on a configurable schedule. Unlike the zero-copy Archives mode, Export is “fire and forget”: Bufstream continuously pushes data into the table independently of Kafka topic retention, making your data available in analytics systems that manage their own Iceberg tables.
This mode is ideal when:
- You want to push data into a managed Iceberg table, such as a Databricks Managed Table.
- You want your Iceberg table to persist independently of Kafka topic retention.
- You need to partition the table by message fields in addition to time.
Supported catalogs#
Both modes integrate with your existing data workflows through native support for:
- REST Catalogs (including Apache Polaris™)
- AWS Glue™ Data Catalog
- Google BigQuery™ Metastore
Bufstream’s REST catalog support allows you to deploy a REST adapter in front of Iceberg catalogs that don’t have direct integration, enabling compatibility with any existing catalog as well as bespoke implementations.
Why Iceberg?#
The existing processes for getting data out of Kafka and into a data lake, warehouse, or analytics tool are prone to human error, introduce duplicative storage, and increase operational expenses. Bufstream eliminates these complex and brittle processes, delegates data transformation to the broker, and makes object storage the single source of truth for ready to query data—leading your team to critical insights faster.
Over the last few years, Iceberg has grown to be the leading standard for storing big data sets in the data lakehouse, and as a result the ecosystem has grown to unite data platform teams and analytics teams with tools like Apache Spark, Amazon Athena, Dremio, Trino, and Starburst. These conditions made Iceberg a great fit for Bufstream.
Today to shift data out of Kafka and into a data warehouse teams must do some or all of the following:
- Set up a consumption workflow that requires additional compute and storage, utilizing Kafka Connect or bespoke direct-to-data-lakehouse engines.
- Create and maintain a complex pipeline of operations that transform the data to a columnar format (like Parquet), materialize the data, and address any schema changes or evolution.
- Guard against degraded performance by manually cleaning up the small files that pile up in object storage as a result of continuous transformations of streaming data.
As a result, teams require twice the time and expense to use the same data in a downstream system.
What does Bufstream do differently?#
Bufstream shifts all of the work to materialize, transform, and validate data into the streaming data pipeline, reducing the maintenance, cost, and operations burden for data platform and analytics teams. Bufstream’s brokers are schema-aware, semantically intelligent, and able to transform data in transit, meaning there is only one tool and process needed to stream and process your data to get it ready for the lakehouse and analytics engines.
Bufstream’s broker-side semantic validation ensures that data entering your lakehouse or used by query engines conforms to a known schema from your schema registry as well as meets validation requirements for each field, forbidding malformed or invalid data from entering your lakehouse. Once the data has been assessed for quality, Bufstream transforms your data into Parquet and materializes the Iceberg metadata from the approved schema, eliminating manual transformation tools that need routine maintenance for every change made to application data. As a result, lakehouse compatible data rests in object storage as a source of truth without having to transform, materialize, and persist a new copy of the raw data just for analytics use-cases.
What’s next?#
- Learn how to configure Iceberg Archives or Iceberg Export.
- Take a deep dive into the Iceberg Archives reference or the Iceberg Export reference.
- Configure Databricks Managed Iceberg tables.