Protobuf resources

Descriptors

Descriptors describe Protobuf definitions. Their representation in Protobuf is in google/protobuf/descriptor.proto (included with Protobuf compilers). They're a fundamental part of the Protobuf ecosystem and the foundation of Protobuf’s plugin system, as well as all reflection-based tasks. If you want to understand how Protobuf works and use Protobuf in more advanced ways, you need to understand how descriptors work.

What are descriptors?

The term “descriptors” refers to models that describe the data types defined in Protobuf sources. They resemble an AST (Abstract Syntax Tree) of the Protobuf IDL, using Protobuf messages for the tree nodes.

Descriptors are used for three key purposes:

  • Code generation: When the compiler invokes a plugin to generate code, it uses descriptors to provide the plugin with a description of what to generate.
  • Reflection: The ability to examine the Protobuf schema details at runtime. This allows code to examine metadata that is defined in the Protobuf sources, such as inspecting field numbers, or custom options defined on elements like messages, fields, and methods.
  • Dynamic messages and dynamic RPC: The ability to interact with message types and Protobuf RPC services without generating code. Instead of statically generating RPC interfaces and data structures to represent request and response messages, the descriptors can be used at runtime to build dynamic data types and RPC stubs.

Descriptors can be a source of confusion for developers who are learning Protobuf. There are a few reasons:

  • Historically, descriptors haven't been well documented and described. Often, they are deeply understood by only a few Protobuf “experts” at an organization, where that knowledge may be gained by a few through extensive tinkering or handed down from one person to the next via pair programming.

  • Descriptors are self-referential: not only do they describe Protobuf sources, they are actually defined using Protobuf sources as well. These descriptor Protobuf messages are produced by a compiler and used to perform code generation. Their contents are also embedded into the generated code in some runtimes. This means you need some basic knowledge of Protobuf—about the IDL syntax, its concepts, and its type system—before trying to learn about descriptors.

  • Some aspects of descriptors and how IDL source code maps to a descriptor representation are non-intuitive. Numerous quirks exist, but one of the biggest hurdles is that descriptors are file-centric. Instead of simply describing types, descriptors describe files which contain the types. This can complicate the use of descriptors, which can further encumber new developers trying to learn about them.

This article aims to demystify descriptors. The next sections will dive more deeply into what descriptors are and how they can be effectively used.

Deep dive into the model

As mentioned above, the well-known file google/protobuf/descriptor.proto defines Protobuf messages for the various kinds of descriptors.

It uses “proto2” syntax because it makes use of extensions, which aren't allowed in “proto3” syntax. We can use the extension ranges defined in this file to create custom options (more on that later).

The descriptor.proto file defines a message for each type of element in the language. Throughout the rest of this article, the google.protobuf package prefix will be omitted, so we can refer to relevant messages using shorthand.

  • FileDescriptorProto: The root of the hierarchy. This describes a single source file. It contains references to all top-level types defined in the file: messages, enums, extensions, and services.

  • DescriptorProto: This describes a message. It is confusingly named since, unlike all the other messages, it has no prefix that indicates the element type it describes (i.e. it lacks a “Message” prefix in the name). This is a historical relic—you can think of this as MessageDescriptorProto in effect.

    This element may also contain references to other nested types: messages, enums, and extensions that are defined inside another message.

  • FieldDescriptorProto: This describes a field, defined in a message, or an extension.

  • OneofDescriptorProto: This describes a oneof defined in a message. Note that this structure doesn't contain any references to fields. Instead, a FieldDescriptorProto refers to its enclosing oneof via an index, and the OneofDescriptorProto itself is just a placeholder.

  • EnumDescriptorProto: This describes an enum. It contains enum values.

  • EnumValueDescriptorProto: This describes an enum value, also called an enum entry.

  • ServiceDescriptorProto: This describes a service. It contains methods.

  • MethodDescriptorProto: This describes a method, also called an “RPC”.

So the full hierarchy, in a tree diagram, looks like this:

─ FileDescriptorProto
   ├─ DescriptorProto           // Messages
   │   ├─ FieldDescriptorProto  //   - normal fields and nested extensions
   │   ├─ OneofDescriptorProto
   │   ├─ DescriptorProto       //   - nested messages
   │   │   └─ (...more...)
   │   └─ EnumDescriptorProto   //   - nested enums
   │       └─ EnumValueDescriptorProto
   ├─ EnumDescriptorProto       // Enums
   │   └─ EnumValueDescriptorProto
   ├─ FieldDescriptorProto      // Extensions
   └─ ServiceDescriptorProto    // Services
       └─ MethodDescriptorProto

You can find more information on how these Protobuf messages are populated by a compiler in the Protobuf Guide.

Options messages

The descriptor.proto file also defines the options messages. There is one such message type for each element in the language for which options may be defined.

  • FileOptions: This message represents the metadata defined by top-level “option” declarations, such as for indicating the Go package or the Java package of generated code.

  • MessageOptions: This represents the metadata defined by “option” declarations inside a message definition. (Note: It contains a field named map_entry that may not actually be used by an “option” declaration.)

  • FieldOptions: This is for options that may appear at the end of a field definition.

    There are two options that can appear here that are not actually represented in the FieldOptions message: default and json_name. While these look like any other option in the source file syntax, they are handled differently by the compiler and stored elsewhere in the corresponding FieldDescriptorProto message.

  • OneofOptions: This is for “option” declarations inside a oneof definition.

  • ExtensionRangeOptions: For options that may appear at the end of an extension range declaration (only allowed in “proto2” syntax).

  • EnumOptions: For “option” declarations inside an enum definition, such as indicating if the enum allows aliases (multiple names for the same numeric value).

  • EnumValueOptions: For options that may appear at the end of an enum value definition.

  • ServiceOptions: For “option” declarations inside a service definition.

  • MethodOptions: For “option” declarations inside a method definition, such as the method’s idempotency level.

An important quality shared by all of the above options messages is that they are all extendable. They all have extension ranges starting at field number 1000. This is what enables “custom options” (often called annotations), which are actually extension fields of an options message.

The following example shows a simple source file that uses options:

syntax = "proto3";
package foo.bar;

// These next two lines are file options
option go_package = "github.com/foo/bar";
option java_package = "com.foo.bar";

message Foo {
  // This is a message option
  option deprecated = true;

  string name = 1;

  // This field has an option (packed)
  repeated int64 class_ids = 2 [packed = true];

  // As does this one (debug_redact)
  string ssn = 3 [debug_redact = true];
}

And here’s the corresponding FileDescriptorProto (shown using the JSON format; some inconsequential fields omitted for brevity):

{
  "syntax": "proto3",
  "package": "foo.bar",
  "options": {
    "go_package": "github.com/foo/bar",
    "java_package": "com.foo.bar"
  },
  "message": [
    {
      "name": "Foo",
      "options": { "deprecated": true },
      "field": [
        { "name": "name", "number": 1, "type": "TYPE_STRING", "label": "LABEL_OPTIONAL" },
        {
          "name": "class_ids",
          "number": 2,
          "type": "TYPE_STRING",
          "label": "LABEL_REPEATED",
          "options": { "packed": true }
        },
        {
          "name": "ssn",
          "number": 3,
          "type": "TYPE_STRING",
          "label": "LABEL_OPTIONAL",
          "options": { "debug_redact": true }
        }
      ]
    }
  ]
}

In the above example, we can clearly see how “option” declarations in the source file correspond to the options fields of the descriptor messages.

  • On lines 5 and 6 of the source, we see file options that map to fields of the same name in the FileOptions message (here and here, respectively).

  • On line 10, we see a message option that maps to a field in the MessageOptions message (here).

  • And on lines 15 and 18, we see field options that map to fields of the same name in the FieldOptions message (here and here, respectively).

Custom options

A custom option is an extension field on one of the options messages. Here’s an example:

syntax = "proto3";
package foo.bar;
import "google/protobuf/descriptor.proto";

// By extending FileOptions, we are creating custom file options.
extend google.protobuf.FileOptions {
  // Each field defined in this block is an extension.

  // This extension's full name is "foo.bar.baz" since it's in a
  // file whose package is "foo.bar".
  string baz = 30303;
  // NOTE: We can't create another extension in this file named "baz"
  // that extends a different message because it would also be named
  // "foo.bar.baz". The extended message (in this case FileOptions)
  // is *not* part of the extension's name.
}

// Now we can use the above custom option.
// Note the use of parentheses around the name, which indicate
// that it's an extension name.
option (foo.bar.baz) = "abc";

The above demonstrates a custom file option. The same can be done for other kinds of options by simply extending one of the other options messages. For example, extending MessageOptions creates custom message options.

Source code information

A FileDescriptorProto may optionally contain source code information. It is modeled via the Protobuf message SourceCodeInfo and indicates position information (i.e. line and column) for elements defined in the file, such as the location in the file where a particular message or field is declared. It also includes comment information, such as documentation comments for said message or field.

However, the way it is modeled is neither intuitive nor simple.

One might imagine storing the location spans and comments inline in each descriptor Protobuf message. But that would make each message larger even when no source code info is present (at least in most languages, where the fields are laid out into a struct in memory and take up space even if they are all null or empty). It also makes stripping source code info more complicated and error-prone: every element in the hierarchy must be visited to clear out these fields.

To avoid these issues, source code info is stored in a separate, “look-aside” structure. It is a separate field on the FileDescriptorProto, so stripping this information is as trivial as clearing that one field. And, since it’s not inlined, structs that represent descriptor Protobuf messages in generated code aren’t bloated with extra fields.

This look-aside structure is keyed by an element’s “path”. The element’s path represents a traversal from the root of the FileDescriptorProto message to where that element is defined.

Let’s look at an example:

syntax = "proto3";

package foo.bar;

enum Foo {
  FOO_UNSPECIFIED = 0;
  FOO_BAR = 1;
  FOO_BAZ = 2;
}

message Fizz {
  string name = 1;
}

message Buzz {
  uint64 id = 1;
  repeated string tags = 2;
  Foo foo = 3;
}

The above source turns into a FileDescriptorProto message that looks more or less like the following:

{
  "syntax": "proto3",
  "package": "foo.bar",
  "enum": [
    {
      "name": "Foo",
      "value": [
        { "name": "FOO_UNSPECIFIED", "number": 0 },
        { "name": "FOO_BAR", "number": 1 },
        { "name": "FOO_BAZ", "number": 2 }
      ]
    }
  ],
  "message": [
    {
      "name": "Fizz",
      "field": [{ "name": "name", "number": 1, "type": "TYPE_STRING", "label": "LABEL_OPTIONAL" }]
    },
    {
      "name": "Buzz",
      "field": [
        { "name": "id", "number": 1, "type": "TYPE_UINT64", "label": "LABEL_OPTIONAL" },
        { "name": "tags", "number": 2, "type": "TYPE_STRING", "label": "LABEL_REPEATED" },
        { "name": "foo", "number": 3, "type": "TYPE_ENUM", "label": "LABEL_OPTIONAL", "type_name": ".foo.bar.Foo" }
      ]
    }
  ]
}

If we want to look up the source code info for the field named tags of message Buzz, we have to traverse the above like so:

  1. First we descend into the top-level field message. This corresponds to field number 4 of FileDescriptorProto, whose type is repeated DescriptorProto messages.
  2. We are now in an array. Buzz is the message at index 1. Indices are zero-based indices, so the message at index zero is Fizz; one is the index of the second entry.
  3. Now we descend into the field named **field**. This corresponds to field number 2 of DescriptorProto, whose type is repeated FieldDescriptorProto messages.
  4. We are in another array. The field tags is again the second element, so we go to index 1.

At this point in the traversal, we have arrived at the definition of the field tags. The traversal path is [4, 1, 2, 1], corresponding to the field numbers and array indices through which we traversed.

For another example, let’s say we want the position of the name of the enum value FOO_BAZ—not just the entire declaration but specifically the name. The traversal follows:

  1. Top-level field enum, which is field number 5.
  2. Into index 0 of the array (first and only item in this array).
  3. Field value, which is field number 2.
  4. Into index 2 of the array.
  5. Field name, which is field number 1.

So the path to this element is [5, 0, 2, 2, 1].

Given a traversal path, we can then examine the file’s source_code_info field, if it is present. Therein is a list of locations, each of which looks like so:

message Location {
  // The traversal path. The rest of the fields contain information
  // about the element at this path.
  repeated int32 path = 1 [packed = true];

  // This field indicates the position information for the element.
  // Always has exactly three or four elements: start line, start column,
  // end line (optional, otherwise assumed same as start line), end column.
  // These are packed into a single field for efficiency.  Note that line
  // and column numbers are zero-based -- typically you will want to add
  // 1 to each before displaying to a user.
  repeated int32 span = 2 [packed = true];

  // The fields below contain comments for the element. Comments are present
  // only for full declarations. For example, the path to a field will have
  // comments for that field. But traversal paths to components in the field
  // (like its name, number, or type) won't have comments.

  // The comments right before the element. This is typically a documentation
  // comment for the element.
  optional string leading_comments = 3;
  // The comments right after the element, if present and not associated with
  // the subsequent element.
  optional string trailing_comments = 4;
  // Any detached comments between the previous element and this one. A
  // detached comment may be separated from an element via a blank line or
  // may be otherwise ambiguous and not clearly attached to this element or
  // the previous one. If an element has all three of these fields they are
  // in the following order:
  //    // leading detached comments
  //
  //    // leading comments
  //    element
  //    // trailing comments
  repeated string leading_detached_comments = 6;
}

So we can iterate through the locations to find one that has a matching path. You can read more about how a compiler populates source code info in the Protobuf Guide.

Descriptor information that is embedded in generated code will not include source code info. This is to reduce the size of the resulting packages and/or compiled binaries. Descriptors provided to code generation plugins, on the other hand, should always include source code info. That way a code generator can propagate comments in the Protobuf source to comments in generated code.

Generating and exchanging descriptors

The final message of note in google/protobuf/descriptor.proto is FileDescriptorSet: this is a collection of files, typically in topological order. Topological order means that a file in the set will always appear after all files that it depends on. So a program that is processing the files can simply iterate over the contents and know that any dependencies of a file will have already been processed.

Compilers can produce a file containing a serialized FileDescriptorSet.

The -o option tells buf to create a file with the given name. Its contents are a serialized FileDescriptorSet, encoded using the Protobuf binary format. You may optionally specify the --exclude-source-info flag to strip source code info from the resulting descriptors. This can shrink the resulting file, if source code info isn't needed (depends on how the descriptors will be used).

$ buf build ./proto \
    -o descriptors.binpb

The -o option works the same way with protoc. The --include_imports flag is important: without it, the resulting file may be incomplete and not loadable by an application. The --include_source_info flag is optional: without it, the resulting descriptors won't contain source code info (which may or may not be useful, depending on how the descriptors will be used).

$ protoc -I ./proto \
    foo/bar/test.proto \
    -o descriptors.binpb \
    --include_imports \
    --include_source_info

Generating these files this way is a common way to exchange descriptors, especially in environments where things like server reflection are unavailable. Server reflection is another way to exchange descriptors; it allows clients to download the descriptors that are embedded in the generated code of the server via an RPC.

To see an example of code that uses this serialized form and does something useful with it, see the dynamic message example below.

Quirks of the model

There are a few areas where the way things are represented in these Protobuf messages doesn’t quite match the way they look in the original source or the way we might want them to look for maximum usability.

Here are some of the biggest “impedance mismatches” to keep in mind:

  • The name fields of the various descriptor messages only contain the “simple” name: the single identifier as it appears in source. So users who need to know the fully-qualified name of an element must compute that name. This is done by combining the simple name with the enclosing file’s package (and optionally the names of any enclosing messages, if the element is nested inside a message). Here are examples of fully-qualified names:

    syntax = "proto3";             // Fully-qualified names
    package foo.bar;               // ---------------------
    
    message Baz {                  // foo.bar.Baz
      string name = 1;             // foo.bar.Baz.name
      fixed64 uid = 2;             // foo.bar.Baz.uid
      Settings settings = 3;       // foo.bar.Baz.settings
      message Settings {           // foo.bar.Baz.Settings
        bool frozen = 1;           // foo.bar.Baz.Settings.frozen
        uint32 version = 2;        // foo.bar.Baz.Settings.version
        repeated string attrs = 3; // foo.bar.Baz.Settings.attrs
      }
    }
    
  • When one descriptor references another — such as the type_name of a FieldDescriptorProto or the input_type or output_type of a MethodDescriptorProto — it is done so via a fully-qualified name plus a leading dot. So if a field is an enum of type foo.bar.Enum, then the string in the type_name field will be ".foo.bar.Enum".

  • In a FieldDescriptorProto, there is no value for the type field that indicates that a field is a map. Instead, a map field has a type of TYPE_MESSAGE, and the type_name field points to a synthetic message. (A synthetic message is one that exists in a FileDescriptorProto but has no corresponding message declaration in the source file; it is synthesized by the compiler.)

    This synthetic message has a boolean option named map_entry set to true, which indicates that it’s synthetic. The message has two fields: one named key with field number 1, and another named value with field number 2. The types of these fields match the types of the map key and value in the type declaration. So a type of map<string, uint64> results in a key field with a type of TYPE_STRING and a value field with a type of TYPE_UINT64.

  • In a file that uses “proto3” syntax, use of the optional keyword for fields (only normal fields, not extensions) results in a synthetic oneof. The optional field is the only field in that oneof.

    To distinguish this field from a normal field that happens to be defined in source in a oneof by itself, the field has a boolean setting named proto3_optional set to true. So you know a oneof is synthetic if it contains exactly one field and that field is marked as proto3_optional.

  • The label field of a FieldDescriptorProto is always present, even if there was no such label in the source. So in “proto3” syntax files, fields that don't indicate a label will still have the label field set to LABEL_OPTIONAL. When the optional label keyword is present, a separate proto3_optional field on the FieldDescriptorProto is also set to true. (See above bullet.)

  • The json_name field of a FileDescriptorProto is always present, even if there was no such option on the field in source. This is intended to assist code generation plugins, so they know the correct JSON name for the field even when it was not explicitly set. The downside, however, is that it's not always possible to determine whether a field indicated a custom JSON name explicitly (at least not without access to the original source).

  • The syntax field of a FileDescriptorProto is always present, even if there was no such statement in the source file. Instead of the field being absent, it is set to the string "proto2". So it's not possible to determine whether a file omitted the syntax statement without access to the original source.

Runtime library support

Many of the quirks described above are solved by Protobuf runtime libraries that provide good support for descriptors. Where such support exists, it is provided by wrapper types, which wrap the Protobuf messages and provide an improved interface.

These wrapper types are nearly eponymous with the underlying types, except they don't have any Proto suffixes. For example:

  • FileDescriptor is a wrapper around a FileDescriptorProto.
  • Descriptor (or MessageDescriptor in some languages, such as Go and C#) is a wrapper around a DescriptorProto.
  • The same pattern applies for FieldDescriptor, OneofDescriptor, EnumDescriptor, EnumValueDescriptor, ServiceDescriptor, and MethodDescriptor.

These wrapper types are the most common way that applications make use of descriptors. They improve on the Protobuf messages by providing the following:

  1. They provide resolved descriptor instances instead of strings when examining references.

    For example, the type_name field of a FieldDescriptorProto is just a string. If we want to know the actual definition of the referenced type, we have to look at other elements in the enclosing file and possibly all of its imports. The string is a fully-qualified name, but names in descriptor Protobuf messages aren't qualified, so we have to compute fully-qualified names for each element as we search until finding a match.

    But with a FieldDescriptor (the wrapper type, no “Proto” suffix), we can access the referenced type and get back a proper descriptor (another wrapper type) — either an EnumDescriptor or a MessageDescriptor, for example.

  2. They provide access up the hierarchy.

    With an EnumDescriptorProto for example, one can easily examine down the hierarchy, accessing its children such as EnumValueDescriptorProto messages. But there is no simple way to access its enclosing FileDescriptorProto.

    An EnumDescriptor on the other hand (wrapper type, no “Proto” suffix) makes traversing upwards in the hierarchy easy. Runtime libraries provide a way to access an element’s immediate parent as well as a way to access the element’s enclosing file.

  3. They provide the element’s fully-qualified name. As mentioned already, the name field in a raw Protobuf message for a descriptor is a simple, unqualified name. The runtimes’ wrapper types take care of the work of computing the fully-qualified names.

  4. Some of the runtimes also provide capabilities for easily querying for source code info for a descriptor, so you don’t have to worry about computing the path for an element yourself.

Perhaps counter-intuitively, the process for instantiating these wrapper types is file-centric. In order to create a descriptor for a message or RPC service, you have to first create the descriptor for the file that contains the message or service (from a FileDescriptorProto), and then you can query the file to get the relevant MessageDescriptor or ServiceDescriptor. Also, you must have FileDescriptorProto instances for the entire transitive closure of the file — so you need a descriptor for the file itself, all of its imports, all of their imports, and so on. When creating a FileDescriptor this way, it is imperative that the source file’s import statements match the paths used when the imported files themselves were compiled. (See Protobuf files and packages for more details.)

Six of the official runtimes (those implemented and supported by Google) support descriptors. The sections below provide links to relevant API documentation as well as briefly describing the process for creating a FileDescriptor wrapper type.

C++

To create a FileDescriptor from a FileDescriptorProto, use DescriptorPool::BuildFile. All of the file’s dependencies must have already been built using the same pool.

Java

To create a FileDescriptor from a FileDescriptorProto, use FileDescriptor.buildFrom. All of the file’s dependencies must have already been built and are passed to this function along with the FileDescriptorProto.

Go

To create a protoreflect.FileDescriptor from a descriptorpb.FileDescriptorProto, use protodesc.NewFile and provide a protoregistry.Files as the resolver. This requires that all of the file’s dependencies have already been built and registered in the resolver.

Python

To create a FileDescriptor from a FileDescriptorProto, use DescriptorPool.Add. All of the file’s dependencies must have already been added to the same pool. Once added, you can retrieve the resulting FileDescriptor using DescriptorPool.FindFileByName.

C#

C# has partial support.

In C#, the runtime library doesn't provide the ability to create new descriptors at runtime or to create dynamic messages. There is a function (FileDescriptor.FromGeneratedCode) which allows for creating a FileDescriptor from a binary-encoded FileDescriptorProto, but it also requires other metadata about the corresponding generated types and isn't intended for use outside of the generated code. There is an open issue in GitHub about this gap.

PHP

PHP has partial support.

In PHP, there are internal classes that represent the descriptor wrapper types and the underlying descriptor Protobuf messages. However, the only related public API, DescriptorPool, is for interacting with descriptors embedded in generated code. So PHP doesn't provide the ability to create new descriptors at runtime or to create dynamic messages.

Use cases for descriptors

The following sections describe some use cases that require the use of descriptors and include links to relevant runtime library APIs and example code.

Reflection

The most common reason to use descriptors is for reflection. Reflection allows runtime introspection of the Protobuf schema for generated types, such as querying for field options or custom message options. Reflection also allows code to modify a message value in a generic way, without needing to know the concrete message type at compile time.

Each language runtime provides a way to access descriptors for generated types.

C++

Generated message classes in C++ are sub-classes of Message and thus have a method named GetDescriptor. They also have a method named GetReflection, which allows for reflective access to the message’s fields (both for reading and writing field values).

Note that if a file uses an option like option optimize_for = LITE_RUNTIME; then the generated messages are instead sub-classes of MessageLite and thus do not include support for descriptors or reflection.

Java

Generated message classes in Java implement the Message interface, which provides descriptor access via the getDescriptorForType method. Reflection is done via methods like getField and getRepeatedField. Messages are immutable in Java. To modify a message via reflection, one must first call the toBuilder method to convert it to a Message.Builder and then use methods like clearField, setField, and setRepeatedField.

Note that if a file uses an option like option optimize_for = LITE_RUNTIME; then the generated messages instead implement MessageLite and thus do not include support for descriptors or Protobuf reflection.

Go

Generated message structs in Go implement the proto.Message interface, which includes a ProtoReflect() protoreflect.Message method. A descriptor can be accessed by calling Descriptor() on the returned value. The other methods on the returned value provide reflection capability, with methods for both querying and mutating the message data.

Python

Generated message classes in Python are sub-classes of Message. This provides methods for querying and mutating the message via reflection and also includes a property named DESCRIPTOR, for access the message’s descriptor.

C#

Generated message classes in C# implement the IMessage interface, which provides descriptor access via the Descriptor property. Reflection is achieved by using the Accessor method on FieldDescriptor instances, which provides the ability to reflectively query and modify field values.

PHP

In PHP code, descriptors can be queried by first calling DescriptorPool::getGeneratedPool() and then using the getDescriptorByClassName method of the returned pool. This provides access to descriptors, however reflection isn't supported in PHP.

Example

The example code below demonstrates using reflection in Go. It’s a simple redaction function that removes field values that may contain sensitive data. The function accepts any kind of message (all generated structs that correspond to Protobuf messages implement the proto.Message interface), uses its message descriptor to inspect all fields, and, for each field where the debug_redact option is set, it clears the field value in the message.

package redact

import (
	"google.golang.org/protobuf/proto"
	"google.golang.org/protobuf/reflect/protoreflect"
	"google.golang.org/protobuf/types/descriptorpb"
)

// Redact removes field values from the given message for fields that are
// marked with the "debug_redact" option.
func Redact(message proto.Message) {
	redact(message.ProtoReflect())
}

func redact(msgReflect protoreflect.Message) {
	msgReflect.Range(func(field protoreflect.FieldDescriptor, value protoreflect.Value) bool {
		// Clear the field if it's redacted
		if field.Options().(*descriptorpb.FieldOptions).GetDebugRedact() {
			msgReflect.Clear(field)
			return true
		}

		// If keeping the field, we need to recurse into any nested messages
		// to clear any redacted fields therein.
		switch {
		case field.IsMap() && isMessageKind(field.MapValue().Kind()):
			// map where values are messages
			value.Map().Range(func(mapKey protoreflect.MapKey, mapValue protoreflect.Value) bool {
				redact(mapValue.Message())
				return true
			})
		case field.IsList() && isMessageKind(field.Kind()):
			// list of messages
			list := value.List()
			for i := 0; i < list.Len(); i++ {
				redact(list.Get(i).Message())
			}
		case isMessageKind(field.Kind()):
			// singular message
			redact(value.Message())
		}
		return true
	})
}

func isMessageKind(kind protoreflect.Kind) bool {
	return kind == protoreflect.MessageKind || kind == protoreflect.GroupKind
	// NOTE: Groups are a legacy feature of proto2. A group field
	// behaves semantically just like a message field, but it has
	// a special encoding in the binary format.
}

Dynamic messaging

Dynamic messages are used to interact with message data for types that are not known at compile time. The typical flow for using Protobuf messages is to generate code for a particular message type and then use that generated code to interact with message data (like reading and validating or writing serialization formats). That obviously requires knowledge of the message type ahead of time.

But for generic tooling like a dynamic proxy, where the set of message types that may need to be processed isn't known ahead of time, we have to use dynamic messages.

The basic outline for such a generic message processing tool follows:

  1. The tool must be given the schema for the message to be processed. For the simplest case, let’s say we are building a command-line tool. So we can have the user tell the tool the fully-qualified name of the message to process. The user must also provide the descriptors which define the message for the schema (which can be produced by a compiler).
  2. The tool will need to process the given descriptors and find the descriptor for the requested message type. The descriptors are provided in the form of a set of FileDescriptorProto messages. So we have to convert these to “rich” FileDescriptor values and then use the given fully-qualified name to find the matching message descriptor.
  3. Now we get to the good part: once we have a message descriptor, we can create a dynamic message. In all runtimes that support dynamic messages, this is usually a straight-forward function or constructor that accepts a message descriptor and returns a message. The returned message acts like any other message, so we can use it to de-serialize data.
  4. For our command-line tool, let’s say we’re going to read binary-encoded message data from stdin and then print it in the form of human-readable text to stdout. This involves reading data from stdin, using the Protobuf runtime library to unmarshal that data into the dynamic message we created in the previous step, marshalling that message using the Text Format, and then printing the results to stdout.

Here’s an example Go program that demonstrates each of these steps:

package main

import (
	"fmt"
	"io"
	"log"
	"os"

	"google.golang.org/protobuf/encoding/prototext"
	"google.golang.org/protobuf/proto"
	"google.golang.org/protobuf/reflect/protodesc"
	"google.golang.org/protobuf/reflect/protoreflect"
	"google.golang.org/protobuf/reflect/protoregistry"
	"google.golang.org/protobuf/types/descriptorpb"
	"google.golang.org/protobuf/types/dynamicpb"
)

func main() {
	if len(os.Args) != 3 {
		log.Fatalf("%s: exactly two arguments expected (descriptor set file and message type) but instead got %d\n", os.Args[0], len(os.Args)-1)
	}
	fileDescriptorSet := os.Args[1]
	messageType := protoreflect.FullName(os.Args[2])
	if !messageType.IsValid() {
		log.Fatalf("message type %q is not a valid fully-qualified type name\n", messageType)
	}

	// Read descriptors from file
	var files descriptorpb.FileDescriptorSet
	data, err := os.ReadFile(fileDescriptorSet)
	if err != nil {
		log.Fatalln(err)
	}
	if err := proto.Unmarshal(data, &files); err != nil {
		log.Fatalf("failed to process descriptors in %s: %v\n", fileDescriptorSet, err)
	}

	// Process descriptors from Protobuf into their runtime representation
	var registry protoregistry.Files
	for _, file := range files.File {
		fileDescriptor, err := protodesc.NewFile(file, &registry)
		if err != nil {
			log.Fatalf("failed to process %q: %v\n", file.GetName(), err)
		}
		if err := registry.RegisterFile(fileDescriptor); err != nil {
			log.Fatalf("failed to process %q: %v\n", file.GetName(), err)
		}
	}

	// Get descriptor for message type
	descriptor, err := registry.FindDescriptorByName(messageType)
	if err != nil {
		log.Fatalf("failed to find message type %q in given descriptors: %v\n", messageType, err)
	}
	messageDescriptor, ok := descriptor.(protoreflect.MessageDescriptor)
	if !ok {
		log.Fatalf("element named %q is not a message (%T)\n", messageType, descriptor)
	}

	// Now we can create a dynamic message and use that to read the binary format from stdin
	messageData, err := io.ReadAll(os.Stdin)
	if err != nil {
		log.Fatalf("failed to read message data from stdin: %v\n", err)
	}
	message := dynamicpb.NewMessage(messageDescriptor)
	if err := proto.Unmarshal(messageData, message); err != nil {
		log.Fatalf("failed to process input data for message type %q: %v\n", messageType, err)
	}

	// And write text format to stdout
	_, _ = fmt.Print(prototext.Format(message))
}

Code generation

The plugin protocol, for implementing custom code generation with Protobuf, is built on file descriptors: a plugin process reads a serialized CodeGeneratorRequest from its stdin, and that message includes FileDescriptorProto instances for the files being generated (as well as all of their imports).

So one could use similar techniques to the code sample above for processing FileDescriptorProto instances into richer descriptors, to make working with the schema easier.

Luckily, it’s not necessary to get into such low-level details if you are writing a plugin in C++ or Go. These runtimes provide some library support to help with authoring plugins.

C++

The C++ runtime library includes helpers for implementing code generation plugins. Simply create a sub-class of CodeGenerator that overrides the pure virtual Generate method. Then create a main function for your program that calls PluginMain, like so:

int main(int argc, char* argv[[]]) {
   MyCodeGenerator generator;
   return google::protobuf::compiler::PluginMain(argc, argv, &generator);
}

Your generator is provided the FileDescriptor* for which code should be generated.

The GeneratorContext* passed to your generator has methods you can use to create the output files into which you write the generated code contents. It offers both Open, for creating generated files, and OpenForInsert, for generating content to insert into another generated file.

Note that you must use these methods—creating files directly on the file system isn't allowed.

Use of insertions and insertion points isn't supported by all plugins. But for plugins that do support insertions, you’ll see markers in their generated output that look like @@protoc_insertion_point(NAME) (for example, here and here). These are places in the generated code into which another plugin can inject more code. This allows one plugin to augment the code generated by another plugin.

The C++ runtime also includes a few helper functions to aid in writing plugins for a handful of languages. These functions can be useful if the code you are generating needs to import code generated by the one of the core generators and/or refer to generated types and symbols. These helpers are available for C#, Java, and Objective C.

Go

The Go runtime library includes a package to help implement code generation plugins. Simply create a function that accepts a *protogen.Plugin and drop the following in your main function:

func main() {
	protogen.Options{}.Run(func myCodeGen(plugin *protogen.Plugin) error {
		// ... generate code ...
	})
}

The *protogen.Plugin passed to your function indicates the source files for which code should be generated. These are in the form of protogen.File values. This type is a wrapper around a protoreflect.FileDescriptor. It provides a parallel structure, for accessing other wrapped descriptors (such as protogen.Message and protogen.Service). These provide additional metadata about each element to aid with generating Go code.

The *protogen.Plugin also provides a NewGeneratedFile method for creating output files. The *GeneratedFile type implements io.Writer, for writing generated code contents, but also has several other functions to aid in the generation of Go code.

Note that the Go runtime library does not support insertion points. Also note that you must use this method to create output—creating files directly on the file system isn't allowed.