Apache Iceberg™ configuration#
For Bufstream to store data as Iceberg tables, you'll need to update its configuration and configure topics.
TL;DR#
Start by configuring a schema provider. Then, configure Bufstream for Iceberg integration:
Update your configuration, restart Bufstream, then configure topic parameters:
bufstream kafka config topic set --topic my-topic --name bufstream.archive.iceberg.catalog --value example_rest
bufstream kafka config topic set --topic my-topic --name bufstream.archive.iceberg.table --value bufstream.my_topic
bufstream kafka config topic set --topic my-topic --name bufstream.archive.kind --value ICEBERG
Topic data is retained for seven days. Configure retention.ms and/or retention.bytes to retain data longer. Set retention.ms to a negative value to keep all data indefinitely.
Overview#
Configuring Bufstream's Iceberg configuration is typically three steps:
- Configure a schema provider. Schema providers allow Bufstream to generate and maintain Iceberg table schemas that match your Protobuf message definitions.
- Add a catalog to Bufstream's configuration.
- Set topic configuration parameters for catalog, table name, and archive format.
Once you've set these options, Bufstream begins storing topic data as Iceberg tables while maintaining the corresponding catalog.
Configure a schema provider#
Start by making sure you've configured a schema provider: a Buf Schema Registry, your local development environment, or a Confluent Schema Registry API.
Don't forget to configure a schema provider and set topic configurations like buf.registry.value.schema.module and buf.registry.value.schema.message!
Add a catalog#
Before configuring topics, add at least one catalog to your top-level Bufstream configuration in bufstream.yaml or, for Kubernetes deployments, your Helm values.yaml file. Assign each catalog a unique name.
REST catalogs#
To use a REST catalog with Bufstream, add a catalog with the rest key and the API's URL.
The following example is a minimal working configuration:
Bufstream's reference documentation describes all REST catalog configuration options for both bufstream.yaml and Helm values.yaml, including TLS and authentication configuration options.
AWS Glue™ Data Catalog#
To use an AWS Glue Data Catalog with Bufstream, add a catalog with the aws_glue_data_catalog (bufstream.yaml) or awsGlue (Helm values.yaml) key. AWS Glue Data Catalogs have no required properties.
The following example is a minimal working configuration:
Bufstream's reference documentation describes all AWS Glue Data Catalog configuration options for both bufstream.yaml and Helm values.yaml, including account, region, and access key configuration.
Google BigQuery™ metastore#
To use Google BigQuery metastore with Bufstream, add a catalog with the bigquery_metastore (bufstream.yaml) or bigQuery (Helm values.yaml) key. BigQuery metastores have no required properties.
The example below illustrates how to configure Bufstream to use BigQuery metastore. To ensure that there are no interruptions to archiving or reading data, it includes a recommended configuration.
Bufstream's reference documentation describes all Google BigQuery metastore configuration options for both bufstream.yaml and Helm values.yaml.
Configure topics#
To begin using Iceberg storage, set the following topic configuration parameters. It's OK to update these after a topic has already been created.
- bufstream.archive.kind: This must be set to
ICEBERG. - bufstream.archive.iceberg.catalog: This must be the name of a catalog in your Bufstream configuration.
-
bufstream.archive.iceberg.table: This must be a valid namespace and table for the configured catalog in the form
namespace.table.There must be at least one component in the namespace; a value of
tableis considered invalid since it includes no namespace component.Catalogs may impose other constraints on table names not imposed by Bufstream. For example, BigQuery doesn't allow a table named
tablebecause it's a reserved word.
In addition to the required properties, you can also set the following optional properties:
-
retention.ms and/or retention.bytes: These are the standard Kafka retention properties that control how long to keep topic data, optionally limiting it based on size instead of (or in addition to) age.
The retention policy applies to both the Kafka topic and the Iceberg table: when data in the topic expires, Bufstream removes it from the Iceberg table before deleting it from object storage.
The default retention policy is no size limit with a seven-day age limit. You can override this at the cluster level or per topic. To retain data in the Iceberg table longer than seven days, you must configure these retention settings. To never expire data and keep all history in the table, set
retention.msto a negative value. -
bufstream.archive.parquet.granularity: The granularity used to partition Parquet data files. Also used to define the partition scheme of the Iceberg table.
Valid options are
MONTHLY,DAILY(default), orHOURLY.
Bufstream supports reading and updating topic configuration values from any Kafka API-compatible tool, including browser-based interfaces like AKHQ and Redpanda Console.
Architecture and reference#
For more information about Iceberg integration, including intake and archive considerations, table schema, and performance considerations, see the Iceberg reference.