Skip to content

Apache Iceberg™ configuration#

Bufstream supports two modes of Iceberg integration: Iceberg Archives and Iceberg Export. Both modes require a schema provider and at least one catalog configured in Bufstream.

Configure a schema provider#

Start by making sure you’ve configured a schema provider: a Buf Schema Registry or Buf input.

Don’t forget to configure a schema provider and set topic configurations like buf.registry.value.schema.module and buf.registry.value.schema.message!

Add a catalog#

Before configuring topics, add at least one catalog to your top-level Bufstream configuration in bufstream.yaml or, for Kubernetes deployments, your Helm values.yaml file. Assign each catalog a unique name. The same catalog can be used for both Archives and Export topics.

REST catalogs#

To use a REST catalog with Bufstream, add a catalog with the rest key and the API’s URL.

The following example is a minimal working configuration:

iceberg:
  - name: example_rest
    rest:
      url: https://example-rest.my-domain.com
iceberg:
  catalogs:
    example_rest:
      use: rest
      rest:
        url: https://example-rest.my-domain.com

Bufstream’s reference documentation describes all REST catalog configuration options for both bufstream.yaml and Helm values.yaml, including TLS and authentication configuration options.

AWS Glue™ Data Catalog#

To use an AWS Glue Data Catalog with Bufstream, add a catalog with the aws_glue_data_catalog (bufstream.yaml) or awsGlue (Helm values.yaml) key. AWS Glue Data Catalogs have no required properties.

The following example is a minimal working configuration:

iceberg:
  - name: example_glue
    aws_glue_data_catalog:
      # No further configuration is required. Account, region, and
      # credentials may be specified, but default to those associated with
      # the broker's host.
iceberg:
  catalogs:
    example_glue:
      use: "awsGlue"
      awsGlue:
        # No further configuration is required. Account, region, and
        # credentials may be specified, but default to those associated with
        # the broker's host.

Bufstream’s reference documentation describes all AWS Glue Data Catalog configuration options for both bufstream.yaml and Helm values.yaml, including account, region, and access key configuration.

Google BigQuery™ metastore#

To use Google BigQuery metastore with Bufstream, add a catalog with the bigquery_metastore (bufstream.yaml) or bigQuery (Helm values.yaml) key. BigQuery metastores have no required properties.

The example below illustrates how to configure Bufstream to use BigQuery metastore. To ensure that there are no interruptions to archiving or reading data, it includes a recommended configuration.

iceberg:
  - name: example_bigquery
    bigquery_metastore:
      # No further configuration is required. Project, location, and
      # cloud_resource_connection may be specified. Project is assumed to
      # be the project in which the Bufstream workload is running.
iceberg:
  catalogs:
    example_bigquery:
      use: "bigQuery"
      bigQuery:
        # No further configuration is required. Project, location, and
        # cloud_resource_connection may be specified. Project is assumed to
        # be the project in which the Bufstream workload is running.

Bufstream’s reference documentation describes all Google BigQuery metastore configuration options for both bufstream.yaml and Helm values.yaml.

Iceberg Archives (zero-copy)#

Iceberg Archives use a zero-copy architecture: the same Parquet files that serve Kafka consumers also back the Iceberg table. Topic retention applies to the Iceberg table—when data expires from the topic, it is removed from the table too.

TL;DR#

Configure a schema provider and a catalog, then set topic parameters:

Configure topic for Iceberg Archives
bufstream kafka config topic set --topic my-topic --name bufstream.archive.iceberg.catalog --value example_rest
bufstream kafka config topic set --topic my-topic --name bufstream.archive.iceberg.table --value bufstream.my_topic
bufstream kafka config topic set --topic my-topic --name bufstream.archive.kind --value ICEBERG

Topic data is retained for seven days by default. Configure retention.ms and/or retention.bytes to retain data longer. Set retention.ms to a negative value to keep all data indefinitely.

Configure topics#

To begin using Iceberg Archives, set the following required topic configuration parameters. It’s OK to update these after a topic has already been created.

  • bufstream.archive.kind: Must be set to ICEBERG.
  • bufstream.archive.iceberg.catalog: Must be the name of a catalog in your Bufstream configuration.
  • bufstream.archive.iceberg.table: Must be a valid namespace and table for the configured catalog in the form namespace.table.

    There must be at least one component in the namespace; a value of table is considered invalid since it includes no namespace component.

    Catalogs may impose other constraints on table names not imposed by Bufstream. For example, BigQuery doesn’t allow a table named table because it’s a reserved word.

You can also set the following optional properties:

  • retention.ms and/or retention.bytes: Standard Kafka retention properties that control how long to keep topic data, optionally limiting it based on size instead of (or in addition to) age.

    The retention policy applies to both the Kafka topic and the Iceberg table: when data in the topic expires, Bufstream removes it from the Iceberg table before deleting it from object storage.

    The default retention policy is no size limit with a seven-day age limit. You can override this at the cluster level or per topic. To retain data in the Iceberg table longer than seven days, you must configure these retention settings. To never expire data and keep all history in the table, set retention.ms to a negative value.

  • bufstream.archive.parquet.granularity: The granularity used to partition Parquet data files. Also used to define the partition scheme of the Iceberg table.

    Valid options are MONTHLY, DAILY (default), or HOURLY.

Iceberg Export (continuous export)#

Iceberg Export copies data from a Kafka topic into a separate Iceberg table on a configurable schedule. The exported table is independent of Kafka topic retention and is compatible with managed Iceberg catalogs such as Databricks Managed Tables.

Iceberg Export requires Bufstream 0.4.5 or newer.

TL;DR#

Configure a schema provider and a catalog, then set topic parameters:

Configure topic for Iceberg Export
bufstream kafka config topic set --topic my-topic --name bufstream.export.iceberg.catalog --value example_rest
bufstream kafka config topic set --topic my-topic --name bufstream.export.iceberg.table --value bufstream.my_topic
bufstream kafka config topic set --topic my-topic --name bufstream.export.iceberg.commit.freq.ms --value 300000

After the commit frequency elapses, new topic data appears in the Iceberg table.

Configure topics#

To begin using Iceberg Export, set the following required topic configuration parameters. It’s OK to update these on an existing topic.

  • bufstream.export.iceberg.catalog: Must be the name of a catalog in your Bufstream configuration.
  • bufstream.export.iceberg.table: Must be a valid namespace and table for the configured catalog in the form namespace.table.

    There must be at least one component in the namespace; a value of table is considered invalid since it includes no namespace component.

  • bufstream.export.iceberg.commit.freq.ms: How often data is flushed to a new snapshot.

You can also set the following optional properties:

  • bufstream.export.iceberg.granularity: The granularity to use for partitioning the table by date/time. If omitted, the table won’t be partitioned by date/time.

    Valid options are MONTHLY, DAILY, HOURLY, or no value (default).

  • bufstream.export.iceberg.use.ingest.time: Whether to use the ingestion timestamp of the record for date/time partitioning. If false, the record timestamp is used.

    Valid options are TRUE, FALSE (default).

  • bufstream.export.iceberg.partition.fields: Additional fields to use for partitioning the table. See partitioning tables for details and examples.

Partitioning tables#

Bufstream allows you to list additional fields used to partition the table. Each element in the list is the path to a field in the record value that should be used for partitioning. All values referenced must be leaf (scalar) fields.

You can optionally use the prefix val: to indicate fields in the record value. If the prefix is not present, val: is assumed.

You can use * to use the raw bytes of a field as the partition field. Use key:* to partition by the raw bytes of the record key.

Each element can have an optional /N suffix, where N is the number of buckets. When this is used, instead of the field value being directly used as the partition field, the value is hashed and then assigned a bucket between 0 and N-1 (inclusive). This is particularly useful for partitioning by high-cardinality fields without creating many small partitions, which reduces query performance for queries that don’t filter on that field.

Examples#

  • key:*/8

    Partition by the record’s raw key bytes, hashing into eight buckets.

  • val:sales.ae.id

    Partition by a field in the record value. The value is a message with a message field named sales. That message has a message field named ae with a field named id. The table will have a partition for each distinct id value.

  • address.zip_code/32

    Partition by a field in the record value. Similar to above, the value is a message with a field named address, which is a message with a field named zip_code. The zip_code value will be hashed into one of 32 buckets, so the table will have up to 32 partitions for zip codes.

Topic configuration tools#

Bufstream supports reading and updating topic configuration values from any Kafka API-compatible tool, including browser-based interfaces like AKHQ and Redpanda Console.

Architecture and reference#

For more information about Iceberg Archives, including intake and archive considerations, table schema, and performance considerations, see the Iceberg Archives reference. For more information about Iceberg Export, including export data flow, table schema, and behavior differences, see the Iceberg Export reference.