If you wish to travel far and fast, travel light. — Cesare Pavese
TL;DR: The first 15 field numbers are special: most runtimes will decode them much faster than the other field numbers. When designing a message type for decoding performance, it’s good to use these field numbers on fields that are almost always present.
Did you know that not all field numbers are created equal? The number you attach to a field actually affects how quick it is to decode in some runtimes, and can sometimes make a difference for messages that you have to parse a lot of.
This tip is less about specific guidance; you’ll almost never be bothered by this level of optimization. But it’s a good opportunity to understand some of the more subtle performance characteristics of Protobuf decoding.
On the wire, a Protobuf message is a sequence of records, back to back. The first part of each record is called the tag, and it contains both the field number and the wire type of the record. This is what allows Protobuf parsers to skip fields they don’t recognize.
The tag is a 32-bit integer, where the 3 low bits are the wire type (of which there are eight, two of which are unused) and the remaining 29 bits are the field number (this is why field numbers can’t be greater than 2^29-1). This tag is encoded as a variable width integer, or varint.
The Protobuf varint is simple: the 32-bit integer is broken up into five 7-bit chunks (the highest three bits of the last chunk are always zero, because 7 does not cleanly divide 32). Then, we encode each 7-bit chunk as a byte, where the remaining eighth bit is set if there is another 7-bit chunk remaining; this is the continuation bit. Any zero chunks after the last nonzero chunk are discarded.
For example, all of the 32-bit integers up to 127 are encoded as a single byte, equal to truncating that integer down to 8 bits. But 128’s binary representation is 0b1000_0000
, meaning that its second 7-bit chunk is 0b000_0001
. That means we need to encode it as two bytes, 0x0180
.
The two-byte varints go up to 2^14, or 16384. The three byte varints go up to 2^21, the four byte varints up to 2^28, and the remaining 32-bit integers take up 5 bytes. (Exercise: how many bytes does the largest 64-bit integer need to be encoded as a varint, and what are the thresholds for each byte size for 64-bit varints?)
There are many techniques for decoding varints, but most runtimes assume that small varints are more common, and so have explicit special cases for the one and two byte varints. Often decoding a varint only takes a handful of instructions, so a special case can avoid the complicated setup of looping over all of the 7-byte chunks.
In C, that might look like this:
// Parses a varint and advances p past it. Assumes p is followed
// by at least 10 bytes.
uint64_t varint(uint8_t* p) {
// Fastest path for one byte.
uint8_t x = *b++;
uint64_t v = (uint64_t)x;
if x & 0x80 == 0 {
return v;
}
// Faster path for two bytes.
x = *b++;
v |= (uint64_t)x << 7;
if x & 0x80 == 0 {
return v;
}
// Loop fallback.
for (int i = 2; i < 9; i++) {
x = *b++;
v |= (uint64_t)x << (i * 7);
if x & 0x80 == 0 {
return v;
}
}
// Special case for the 10th byte. Why do we need this?
x = *b++;
v |= (uint64_t)x << 63;
if x > 1 {
return v;
}
fail(); // Handle the error condition.
}
If we really care about performance, we might want all of our field numbers to hit that fastest case. Unfortunately, very few field numbers will. You might think that’s the first 127 field numbers, but a full three bytes are spent on the wire type, leaving us with only the field numbers from 1 to 15 as being encoded with one byte (0 is not a valid field number so it doesn’t count).
There are a lot more two-byte fields: all of the field numbers between 16 and 2048 take up the same number of bytes on the wire. It’s quite rare than field numbers (except for extensions) go into the three digits, so you’ll find that only the first 15 are special.
Probably not! Most message types are quite small, so you can fit all the fields in the 15 “special fast fields.” However, it’s worth keeping in mind for messages that will be part of large repeated fields which will only carry a small subset of all their possible fields. This isn’t rare, but it’s certainly uncommon.
The low field numbers are a good example of why understanding the wire format can reveal surprising consequences for choices made when designing a message.