Streaming data quality is broken. Semantic validation is the solution.

When you're working with mission-critical streaming data, any record flowing through your pipelines is a potential point of failure. When schemas drift or malformed data sneaks through, your downstream systems don't just slow down—they break. Hard. And if you're on-call, you know exactly what that 3 AM wake-up call feels like.

What if the solution to these data quality woes was in fact a cure rather than the quick fixes you've been sold over and over again by the marketing hype machines? What if this remedy was built from the ground up to work for you rather than against you? What if you could take control of your streaming data quality once and for all? You'd probably think this sounds too good to be true, but this is exactly why we built Bufstream.

We're going to break down three fundamental concepts:

Client-side vs. Broker-side validation
Schema ID vs. Schema vs. Semantic validation
Runtime vs. Build-time schema governance

Together, we'll prove that these concepts aren't just marketing buzzwords but rather core architectural decisions empowering reliable end-to-end streaming data quality. By the end of this post, you'll understand exactly why these concepts matter, how most streaming platforms get them wrong, and why Bufstream is the only Kafka product to provide a unified solution for all three.

Client-side vs. Broker-side validation

Do you trust all data-producing engineers (data producers) to always do the right thing? Even when they're up against a deadline? It depends, right? For most of us, we're used to following some form of trust but validate—be it via code reviews or other formal change management processes. These rituals become a means to an end—with the end being a low-risk deployment or change to a customer-facing system. Would you be okay with removing these quality gates and process-based control loops and instead allowing everyone to simply "do what they want?" Probably not.

It's for these same reasons we believe client-side validation is such a hard sell. But don't just take our word for it. We'll break down the difference between client-side and broker-side validation next and then see why broker-side validation is the path forwards.

What's the difference?

Client-side validation is when your data is validated on the client when produced, instead of centrally on a server (or a broker). Broker-side validation validates your data centrally on the broker when data is produced, creating a more unified experience. While this may seem like a simple role reversal there is actually no concept of broker-side validation within the Apache Kafka ecosystem. With traditional Kafka, topics act like dumb pipes—blindly connecting data producers and consumers while offloading all data quality concerns.

Client-side validation adds confusion and complexity

This has been the standard way of applying any kind of data validation within the Kafka ecosystem. The responsibility lands directly in the hands of the engineering team tasked with publishing data—even if they are only a proxy between an upstream system and Kafka. The hard sell here is that there is no standard way of applying data quality checks—so each data-producing client is responsible for opting in to per-record validation before publishing any set of records, while also being on the hook to ensure their local schemas are up to date with all other publishers'. Lastly, each data-producing team follows ways of working that they believe work best—even if things could be much easier.

By having no agreed-upon way of properly validating the integrity of the data, we've essentially created a problem that adds complexity at each turn rather than reducing the system-wide level of effort. Why hope that each data producer will do the right thing when you can guarantee that every record is valid before it ever reaches a consumer?

To illustrate the client-side data-producing problem, let's look at a simple example. Assuming the following message:

message User {
  string id = 1; // A UUID for the user. Required.
  string email = 2; // The user's email address. Required.
  uint32 age = 3; // The user's age. Must be a valid age between 0 and 150.
  string last_name = 4; // The user's last name.
  string first_name = 5; // The user's first name.
}

How would you block the following data—which has a missing id, an invalid email address, and a clearly problematic human age—from ever making it to a topic?

{"email": "123 Main St", "age": 1000, "last_name": "Acme", "first_name": "Bob"}

For this question, you don't have to answer straight away. Simply think about the approach you'd take for a moment, and we'll continue. So you're probably thinking about this problem in terms of how you would block invalid data while it's still in transit. One common approach here is to introduce a validation API—think of this like a bouncer at a nightclub—between the data-producing services and the Kafka topics themselves. This is one way to block invalid data at the ingress edge before you've essentially poisoned the proverbial data well: your Kafka topic. While it's a decent strategy, it adds more operational complexity since the technique typically scales out on a per-topic basis and creates a lot more surface area for engineers to manage.

The real reason per-record validation is typically delegated to the client-side is that broker-side schema-aware validation is a difficult problem to solve and one that has to be done at the platform level. However, unlocking centralized schema governance and enabling consistent streaming data quality fits neatly into the broker-side responsibility.

Broker-side validation unifies data ingestion

Broker-side validation introduces additional costs in the form of increased CPU utilization, additional resource overhead at the topic level, and increased responsibility for the brokers in your cluster. To be fair, we're now asking significantly more of the brokers.

With traditional Kafka, topics act like dumb pipes—blindly connecting data producers and consumers while offloading all data quality concerns to them—like we saw when we explored client-side validation. Now, with the introduction of our schema-aware brokers for Bufstream, we've simplified the streaming data quality problem by focusing on Protobuf-driven semantic validation at the broker rather than pushing the problem onto the producers and consumers.

So how do you solve the challenge of validating arbitrary amounts of streamed data in a type-aware manner at the broker level? What about coordinating record-level semantic validation at the topic level? To answer these questions, we'll examine the various styles of schema validation within the Kafka ecosystem and introduce you to Protobuf-driven semantic validation with Bufstream.

Schema ID vs. Schema vs. Semantic validation

Apache Kafka, by design, has no clue what data should or shouldn't be sent to a topic. This led to the creation of the Confluent Schema Registry (CSR), which enables runtime schema governance within the Kafka ecosystem. The two common approaches for runtime governance are schema ID and schema validation—both of which we believe fall short of providing a complete solution for streaming data quality. For Bufstream, we've introduced a third option called semantic validation that is in a class all its own. We'll take a look at the common approaches, and then dive deeper into Protobuf-driven semantic validation.

💡 The Confluent Wire Format adds a 5-byte header to each published record. The 1st byte is the "magic byte", which defaults to 0, and the next 4 bytes make up the schema ID. The schema ID is required to identify a known schema and is a prerequisite for any of the following validation methods.

Schema ID validation

Schema ID validation is a lightweight approach where data-producing clients, or brokers that support it (Apache Kafka does not), check that a known schema ID is included on each message. Producers provide the ID associated with the schema of their data by either including it directly via configuration or fetching it from the Schema Registry at runtime to prepend to each record.

However, schema ID validation doesn't resolve any data quality problems. Since the schema ID is a simple monotonically incrementing integer, a producer guessing a low number will just as likely pass the schema ID check. The validation does nothing more than confirm the presence of a recognized ID—it doesn't verify that the actual data matches the schema. In other words, producers can send any data they want, as long as they claim it's valid!

Schema validation

Schema validation enables either the data-producing client or the schema-aware broker to check more than just the legitimacy of a schema ID—it tests that the encoded record conforms to the shape of the known schema for a given topic.

Going back to the earlier example, we would be able to successfully pass simple schema validation by sending the following User:

{"email": "123 Main St", "age": 1000, "last_name": "Acme", "first_name": "Bob"}

The schema validation process would check the following:

✅ - email is a string
✅ - age is an integer
✅ - last_name is a string
✅ - first_name is a string

...and incorrectly conclude that the payload is valid. The fact that the User is missing an id, or that the email is in fact a partial address isn't a known concern; and that should be concerning.

While this style of check is far superior to schema ID validation—it will parse the payload and check that the values provided are of the correct type—it's more of a glorified type-safety check. It has value, but still misses the mark when compared to rules-based semantic validation.

Semantic validation

Semantic validation is the ultimate litmus test for high-quality streaming data. Rather than simply testing that the received payload has a known schema id (schema ID validation) or conforms to the correct field-level types (schema validation), it deeply inspects the semantic properties—like an email address being more than just a string—and field-level rules—like the valid pattern to test to ensure an email address is not a physical address—using Protovalidate annotations.

These annotations aren't loosely coupled either: they are shipped directly in your Protobuf definitions and made generally available through the Buf Schema Registry (BSR) to your Bufstream brokers. This tight coupling enables you to trust that the data flowing through your topics does in fact behave as intended, so you can rest assured that all data for a given topic follows the rules!

Protovalidate isn't the only way to enable semantic validation for your data—however, it is the gold-standard way for Protobuf and the only option when it comes to broker-side semantic validation with Bufstream.

At this point, we've covered the difference between schema ID and schema validation, and introduced semantic validation conceptually, but it's worth diving deeper to see how semantic validation is a true differentiator for our streaming data quality.

Bufstream semantic validation in action

Earlier we had you imagine a scenario where you defined a basic User, and we asked what your approach would be to prevent invalid data from getting to your topic—don't worry if you didn't have a concrete answer. We'll now introduce a solution for the initial question using Protovalidate to semantically enforce our data quality rules.

Remember, fields within a schema have semantic properties as well:

They can be optional or required.
Fields like email addresses aren't just strings—they must match a specific format.
Numbers, like a user's age, may be a uint32, but you can safely assume a user can't be more than 150 years old.

The following message is the revised User from earlier with the addition of our Protovalidate annotations:

message User {
    string id = 1 [(buf.validate.field).string.uuid = true];
    string email = 2 [(buf.validate.field).string.email = true];
    uint32 age = 3 [(buf.validate.field).cel = {
        id: "user.age",
        message: "age must be between 0 and 150",
        expression: "this <= 150",
    }];
    string last_name = 4 [(buf.validate.field).string = {
      max_len: 100,
    }];
    string first_name = 5 [(buf.validate.field).string = {
        max_len: 100,
    }];
}

The validation rules applied here ensure the following is true for any valid User instance:

Each id conforms to a UUID. Not just a string.
An email address is a valid email. Not just a string.
The age provided is within the range of 0 and 150.
We honor a specific range of characters for the first_name and last_name fields.

With the addition of our schema-aware brokers, Bufstream can now block invalid data in its tracks. If we were to publish the following record to our Bufstream topic:

{"email": "123 Main St", "age": 1000, "last_name": "Acme", "first_name": "Bob"}

...the broker would test the record using the Protovalidate validation rules:

❌ - email: value must be a valid email address
❌ - age: age must be between 0 and 150
✅ - last_name is a string
✅ - first_name is a string

...and reject the record.

👉 The proof is really in the pudding here, so if you'd like to try semantic validation in action please pop on over to our Protovalidate Playground to give this example a try, or better yet read our semantic validation docs to dive deeper into Bufstream.

Now that we've covered the difference between client and broker-side validation, and you've seen our recommendation for Protobuf-based semantic validation, with the option to see it in action for yourself in the Protovalidate Playground, you might be asking yourself what's possibly left to discover. The final piece of the puzzle comes down to how we gate changes to our precious schemas — a process that literally makes or breaks your data pipelines.

Runtime vs. Build-time schema governance

Traditionally anything having to do with schemas in the Kafka ecosystem has been a concern of the data-producing client, and this establishes a system of brittle runtime schema governance even with the CSR in play.

For example, when a client needs to send a message that conforms to a schema, that schema can be lazily registered with the schema registry by the client itself (at runtime). The client simply asks the registry (via the API) if the intended schema is known. If it isn't, the schema is auto-registered as a new version. This means there is effectively no central governance. This is a major problem. Why, you may ask?

Imagine the following breaking change being introduced in the User message when the email field changes to email_address:

message User {
  string id = 1;
  string email_address = 2;
  uint32 age = 3;
  ...
}

When producers register schemas at runtime—with no version control and no guards against ad-hoc registration for rogue or breaking schemas—there is effectively no safety net against breaking changes in production. This is problematic since, with this approach, you're alerted to data quality problems after the point of data ingestion. That, unfortunately, also means a release has gone to production, which is a fundamental problem. The easiest way to ensure this never happens is to introduce build-time schema governance to block breaking changes within your CI/CD workflows.

With Bufstream, schemas can't be registered on the fly from clients: changes must first go through source control and any other necessary reviews before they are made available via the Buf Schema Registry (BSR). Because we provide a CSR-compatible API from the BSR, this enables us to provide true topic-level (or record-level) schema governance holistically—and that should matter for you.

While zero central governance may have worked for your schemas in the past, it is only a matter of time before breaking changes take down any number of mission-critical data pipelines. It is from experience that we've made centralized data governance a core tenet of Bufstream's architecture. We believe it's only through build-time schema governance and true schema-aware broker-side semantic validation that the problems of streaming data quality truly become a thing of the past.

Are you ready to embrace a future where your data works for you vs. against you?

At Buf, we're betting on a future where your streaming infrastructure is as smart about data quality as your API layer. Where schemas aren't just documentation, but enforced contracts. Where bad data gets caught at the broker, not in your quarterly reports.

Want to see what real semantic validation looks like? Try Bufstream, run that experiment from the beginning of this post, and get in touch. Your broker will finally know what an email address is. And your data engineers will thank you.

In this post

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.