Protobuf is the most stable and widely adopted interface description language
available today - it's why Buf is concentrating it's initial efforts on Protobuf. However, Protobuf
has never had an officially-published Protobuf grammar - there are proto2 and proto3 specs published,
but neither actually cover all edge cases, of which there are many (especially around options).
In effect, the official Protobuf "grammar" is the
protoc implementation - this has been the
only codified representation of what Protobuf is, and the only way to properly parse Protobuf messages and produce FileDescriptorSets
suitable for stub generation.
Additionally, there are many situations outside of stub generation that rely on a proper Protobuf parsing, such as documentation generation, linters, and breaking change detectors. All existing Protobuf tooling has gone one of two routes:
- Use a third-party Protobuf parser instead of
protocthat produces non-FileDescriptorSet results. There are many third-party Protobuf parsers in existence, however no parser has been able to reliably cover all edge cases of the grammar, inevitably there are breakdowns that either result in parse errors, or an invalid representation of Protobuf sources. The edge cases in the Protobuf grammar are so numerous, that some of the most popular third-party parsers actually get around the problem by happily parsing invalid Protobuf, resulting in being unable to make a decision from these parsers as to whether or not a file is valid.
- Shell out to (or build against)
protoc. This results in both accurate parsing, and FileDescriptorSet production, however this method presents a number of issues. First, actually managing external
protocinstalls becomes problematic - it makes any tooling reliant on either managing
protocinstalls itself, or relying on
protocbeing deterministically installed. Second, parsing
protoc's output is difficult, as there is no structured output format, both warnings and errors are printed to stderr, and the warning and error output changes between minor releases. To accurately parse
protoc, tooling needs to handle every release of
protocas it comes out, which makes any such tooling unmaintainable. Additionally,
protochas different behavior depending on the location of the Well-Known Types.
We find neither of these solutions to be tenable in the long-term for a tool that aims to manage your Protobuf schema, and eventually host Images suitable for stub generation. Therefore, we've taken a different route:
- Buf's primitive is the Image, an extension of the FileDescriptorSet. This
means that Buf speaks the same internal language as
protocand existing Protobuf plugins.
- Since Buf speaks in terms of Images, Buf also speaks in terms of FileDescriptorSets. This
enables us to take
bufinput - instead of shelling out to
bufallows you to manage your own
protocinstallation and invocation, and merely take results from
- As only using
protocwould result in an unwieldy non-self-contained tool for static analysis, and
protocobviously does not provide extensions and additional verification we want to do as part of our build for the future Buf Schema Registry, Buf uses a newly-developed Golang-based Protobuf compiler that is tested to cover every known edge case that
protocitself covers, and is continuously tested against thousands of widely-used
.protofiles for equivalence with
The internal compiler quite literally replaces
protoc outside of the builtin plugins
--cpp_out, etc.) - we know that's a big statement, and one we would
not trust ourselves.
The resulting FileDescriptorSets are tested for equivalence to
both proto2 and proto3 definitions, imports, FileDescriptorProto ordering, SourceCodeInfo,
and custom options. The result FileDescriptorSets are almost byte-equivalent to
in fact - under most scenarios without SourceCodeInfo, you can actually compare the
byte representation of a serialized FileDescriptorSet produced by
buf and by
and they will be equal. There are two known exceptions that make this not always
- Buf actually produces additional intermediate SourceCodeInfos, and retains more
detached comments, than
protoc. This is strictly more information for consumers of the resulting FileDescriptorSets.
- Buf represent custom/unknown options slightly differently on the wire, although when deserialized, the result is equivalent for consumers of FileDescriptorSets. There is an effort to work around this, so that FileDescriptorSets can be compared for testing, however it is not high priority as it has zero effect on any actual usage.
Besides removing the need to manually manage
protoc and the Well-Known Types (which
buf handles in all cases), Buf's compiler actually has additional advantages:
- Buf does additional build verification to make sure your proto_paths and the files within them do not lead to undetected bugs.
- Buf is actually considerably faster than
protocin most scenarios - Buf parses your
.protofiles across all available cores, and re-orders the result to match
protoc's ordering as a post-processing task. As an example, Buf can compile all 2,311
.protofiles in googleapis in about 0.8s, on a four-core machine, as opposed to about 4.3s for
protocon the same machine.
We know this is all a series of big claims, and we would not trust it ourselves -
there have been many claims in the Protobuf community about producing non-protoc-based
parsing. This is one of the reasons that we enable
protoc output to be
buf input -
if you don't trust us, then use
protoc as your compiler instead, no problem.
It's also one of the reasons we've exposed
buf image build as we have - you can
produce FileDescriptorSets yourself and pass them to your Protobuf plugins to
verify that the resulting stubs are equivalent. There is one known exception with
docs generated based on json_name, see this issue
to track this being updated within
Given the following call:
# Adjust -I as necessary, for example with googleapis, this should be "-I ." $ rm -rf java $ mkdir java $ protoc -I root --java_out=java $(find root -name '*.proto')
You can instead use Buf's compiler to generate your stubs by using the
# For parity with the above example, we're assuming we have our build.roots # configured in a buf.yaml file in the current directory. # # We need to do "buf image build | buf ls-files --input -" instead of "buf ls-files" # to make sure that the filenames are root-relative. $ rm -rf java $ mkdir java $ buf image build -o - | protoc --descriptor_set_in=/dev/stdin --java_out=java $(buf image build -o - | buf ls-files --input -)
This results in protoc's internal parser not being used at all, so you can verify our claims further. If you do find an issue, please contact us.
Having this new compiler is a key component of Buf's future - right now, it enables reliable linting and breaking change detection, but in the future, it enables a lot of real-time possibilities for us.