Apache Iceberg™ configuration#

For Bufstream to store data as Iceberg tables, you'll need to update its configuration and configure topics.

TL;DR#

This minimal example configures Bufstream for Iceberg integration:

bufstream.yamlHelm values.yaml

# Add an Iceberg catalog. If necessary, change "rest" to "aws_glue_data_catalog" or
# "bigquery_metastore" and remove "url". They have no required fields.
iceberg:
  catalogs:
    - name: iceberg-catalog
      rest:
        url: http://example-rest.my-domain.com
# Associate a schema registry and add a minimal "produce" policy.
data_enforcement:
  schema_registries:
    - name: csr
      confluent:
        url: https://my-domain.buf.dev/integrations/confluent/instance-name
  produce:
    - schema_registry: csr
      values:
        on_parse_error: PASS_THROUGH

# Add an Iceberg catalog. If necessary, change "rest" to "awsGlue" or
# "bigQuery" and remove "url". They have no required fields.
iceberg:
  catalogs:
    iceberg-catalog:
      use: rest
      rest:
        url: http://example-rest.my-domain.com
# Associate a schema registry and add a minimal "produce" policy.
dataEnforcement:
  schema_registries:
    - name: csr
      confluent:
        url: https://my-domain.buf.dev/integrations/confluent/instance-name
  produce:
    - schema_registry: csr
      values:
        on_parse_error: PASS_THROUGH

Update your configuration, then configure topic parameters: set bufstream.archive.kind to ICEBERG, bufstream.archive.iceberg.catalog to iceberg-catalog, and bufstream.archive.iceberg.table to a namespace and table, like bufstream.my-topic.

Overview#

Configuring Bufstream's Iceberg configuration is typically three steps:

Add a catalog to Bufstream's configuration.
Configure a schema registry.
Set topic configuration parameters for catalog, table name, and archive format.

Once you've set these options, Bufstream begins storing topic data as Iceberg tables while maintaining the corresponding catalog.

Add a catalog#

Before configuring topics, add at least one catalog to your top-level Bufstream configuration in bufstream.yaml or, for Kubernetes deployments, your Helm values.yaml file. Every catalog must be assigned a unique name.

REST catalogs#

To use a REST catalog with Bufstream, add a catalog with the rest key and the API's URL.

The following example is a minimal working configuration:

bufstream.yamlHelm values.yaml

iceberg:
  catalogs:
    example_rest:
      use: rest
      rest:
        url: http://example-rest.my-domain.com

iceberg:
  catalogs:
    example_rest:
      use: rest
      rest:
        url: http://example-rest.my-domain.com

Bufstream's reference documentation describes all REST catalog configuration options for both bufstream.yaml and Helm values.yaml, including TLS and authentication configuration options.

AWS Glue™ Data Catalog#

To use an AWS Glue Data Catalog with Bufstream, add a catalog with the aws_glue_data_catalog (bufstream.yaml) or awsGlue (Helm values.yaml) key. AWS Glue Data Catalogs have no required properties.

The following example is a minimal working configuration:

bufstream.yamlHelm values.yaml

iceberg:
  catalogs:
    - name: example_glue
      aws_glue_data_catalog:
        # No further configuration is required. Account, region, and
        # credentials may be specified, but default to those associated with
        # the broker's host.

iceberg:
  catalogs:
    example_glue:
      use: "awsGlue"
      awsGlue:
        # No further configuration is required. Account, region, and
        # credentials may be specified, but default to those associated with
        # the broker's host.

Bufstream's reference documentation describes all AWS Glue Data Catalog configuration options for both bufstream.yaml and Helm values.yaml, including account, region, and access key configuration.

Google BigQuery™ metastore#

To use Google BigQuery metastore with Bufstream, add a catalog with the bigquery_metastore (bufstream.yaml) or bigQuery (Helm values.yaml) key. BigQuery metastores have no required properties.

The example below illustrates how to configure Bufstream to use BigQuery metastore. To ensure that there are no interruptions to archiving or reading data, it includes a recommended configuration.

bufstream.yamlHelm values.yaml

iceberg:
  catalogs:
    - name: example-bigquery
      bigquery_metastore:
        # No further configuration is required. Project, location, and
        # cloud_resource_connection may be specified. Project is assumed to
        # be the project in which the Bufstream workload is running.

iceberg:
  catalogs:
    example_bigquery:
      use: "bigQuery"
      bigQuery:
        # No further configuration is required. Project, location, and
        # cloud_resource_connection may be specified. Project is assumed to
        # be the project in which the Bufstream workload is running.

Bufstream's reference documentation describes all Google BigQuery metastore configuration options for both bufstream.yaml and Helm values.yaml.

Configure a schema registry#

Configuring a Confluent API-compatible schema registry allows Bufstream to update your Iceberg table schemas to match Protobuf schemas. Without a schema registry providing Protobuf schema information, topic data can still be written to Iceberg tables, but message contents are stored as raw bytes in the __raw__ column.

A minimal data enforcement configuration defining one schema registry is shown below. At least one produce policy must be created and associated with the schema registry. By default, it applies to all topics.

bufstream.yamlHelm values.yaml

data_enforcement:
  schema_registries:
    - name: csr
      confluent:
        url: https://my-domain.buf.dev/integrations/confluent/instance-name
  produce:
    # This policy applies to all topics and uses the schema registry
    # configured above. Use the "topics:" key to configured topic-specific
    # policies.
    - schema_registry: csr
      values:
        # The action to perform for messages that fail to parse with their
        # associated schema. PASS_THROUGH allows these messages into the
        # topic and Iceberg table, storing their bytes in the val.__raw__
        # column and any associated error message in val.__err__. Use
        # REJECT_BATCH to reject batches of messages with parsing errors.
        on_parse_error: PASS_THROUGH

dataEnforcement:
  schema_registries:
    - name: csr
      confluent:
        url: https://my-domain.buf.dev/integrations/confluent/instance-name
  produce:
    # This policy applies to all topics and uses the schema registry
    # configured above. Use the "topics:" key to configured topic-specific
    # policies.
    - schema_registry: csr
      values:
        # The action to perform for messages that fail to parse with their
        # associated schema. PASS_THROUGH allows these messages into the
        # topic and Iceberg table, storing their bytes in the val.__raw__
        # column and any associated error message in val.__err__. Use
        # REJECT_BATCH to reject batches of messages with parsing errors.
        on_parse_error: PASS_THROUGH

Bufstream's reference documentation describes all data enforcement configuration options, including topic-specific configuration and semantic validation options, for both bufstream.yaml and Helm values.yaml.

Configure topics#

To begin using Iceberg storage, set the following topic configuration parameters. It's OK to update these after a topic has already been created.

bufstream.archive.kind: This must be set to ICEBERG.
bufstream.archive.iceberg.catalog: This must be the name of a catalog in your Bufstream configuration.
bufstream.archive.iceberg.table: This must be a valid namespace and table for the configured catalog in the form namespace.table.

There must be at least one component in the namespace; a value of table is considered invalid since it includes no namespace component.

Catalogs may impose other constraints on table names not imposed by Bufstream. For example, BigQuery doesn't allow a table named table because it's a reserved word.

In addition to the above required properties, you can also set the following optional properties:

bufstream.archive.parquet.granularity: The granularity used to partition Parquet data files. Also be used to define the partition scheme of the Iceberg table.

Valid options are MONTHLY, DAILY (default), or HOURLY.

Bufstream supports reading and updating topic configuration values from any Kafka API-compatible tool, including browser-based interfaces like AKHQ and Redpanda Console.

Architecture and reference#

For more information about Iceberg integration, including intake and archive considerations, table schema, and performance considerations, see the Iceberg reference.