Imagine working on a team that wants to move quickly but whose output is both a product and an API that's consumed by multiple other teams. The product you are building uses said API, but so do other teams. Your code needs to be stable enough to support these other teams needs (an API which doesn't change under them) but you also want to be able to make changes to your own application quickly, thus needing to change the API regularly.
A reasonable move is to version said API and have an ops team that ensures that all in-use versions of the API stay running. Some consumers will be on the bleeding edge, your team's application for example while others will lag behind.
Using proto* in this case is a reasonable move because you gain multiple benefits, performance being perhaps the least important in this case. Having a defined schema for your API provides some level of natural documentation for the API. Code generation allows your team to publish trusted client libraries for multiple languages.
I'll specifically call out client libraries since I've seen it make a dramatic difference in organizational efficiency, mostly to do with team to team trust levels. Without a client library the testing situation becomes a significant burden, read up on contract testing. When the team that's publishing an API also creates the client that most directly calls that API, the client library is the testing surface instead of every consumer of the API needing to test the API itself for regressions.
We use them internally at Square for our RPC mechanism ("Sake", similar to "Stubby", Google's internal RPC mechanism), for our Kafka-based logging/metrics/queue infrastructure, and for defining external JSON APIs. We're in the process of switching from Sake to GRPC, which also use Protobufs as their payload format (although you can sub in different transports).
> Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.
Yes, I read this. It tells me what Protocol Buffers are. Faster, Smaller XML like data structures for serialisation. What are the most common use cases though? And do people only use them for performance reasons?
The most common use cases line up with those of JSON: communication between programs that don't share an address space. The main advantage over JSON (in my opinion) is the definition of an explicit schema. The second (and also important) advantage is in the efficient size of the serialized data, which limits memory, disk, and bandwidth usage. Another (less important to me) advantage is in serialization and deserialization efficiency. A disadvantage is that it requires deserialization for human inspection - that is, it isn't plain text like JSON or XML.
It is similar to Apache Thrift, if you're looking for a non-Google project with similar ideas.
Serialization and deserialization efficiency is specially important for mobile apps, in which the CPU used to parse/serialize JSON (or gzipped JSON) can become very prominent.
Apache Thrift, IIRC, is actually a reimplementation of protos, in the same way that Facebook's Buck is of Google's Bazel.
I have some times looked at "raw" binary protos to inspect the string fields, that happen(ed?) to be byte-aligned and so readable in a text editor. Not sure off the top of my head if that's always the case.
Performance is a nice benefit, but the standardization of message passing is by far the biggest benefit in my opinion. Within a given language, I know that any API I call will have certain unvarying semantics, I can see a highly readable yet formal spec of the data being exchanged, and the code to manipulate these messages will always be familiar and idiomatic.
Duplicating these benefits with XML or JSON would require defining your own grammar and parser, but wouldn't have the performance benefits. Recreating the performance gains would require a new serialization scheme, at which point you'd have broken from JSON and XML standard tools and recreated protobufs in everything but the proto definition language; at that point, why not create a DSL rather than bolting this functionality into an existing one?
In addition to smaller/faster than XML, protobufs make it extremely easy to declare the schema of data, validate data and version your schema. Then the generated wrappers and static type checking in various languages add additional guarantees that you're using the data correctly.
Plain XML still requires a lot to ensure compatibility when it's used across multiple places, protobufs attempt to minimize many sources of the incompatibilities.
Add in a bunch of tools such as protobuf->JSON, protobuf plaintext serialization, etc and it becomes more difficult to argue for using something such as XML or vanilla JSON.
Flatbuffers are still a nice solution for more performance-critical applications.
Yes, I think you are just using a different sense of "based on" than I am. gRPC is based on Stubby in the sense that it is influenced by the design of Stubby and uses the knowledge learned from creating Stubby.
I used protobuf as the output format for a web crawler. Workers read urls and sequentially write entire HTTP responses to disk. [0] Sure, you could serialize the responses to JSON, but the overhead of representing things like binary image data as escaped unicode strings was prohibitive in my case.
"Why not BSON?" Well, schemas can be nice when performance matters. Instead of solving a parsing problem at runtime, a C/C++ reader can contain a compiler-optimized deserializer for a given protobuf schema. It's almost like directly reading and writing an array of C structs, except protobuf is architecture-independent, and you can add new fields without breaking old readers.
There are plenty of reasons to not use protobuf. I particularly disliked the code generation step for C/C++. That makes even less sense in a language like Python, and yet that's exactly what the official python protobuf implementation from Google does (did?). I wrote a python protobuf library on top of a C protobuf library that avoids codegen: https://github.com/acg/lwpb
For me there are three main advantages: schema, performance and code generation.
Having a strict schema makes it a lot easier to maintain applications in a distributed system. Parsing protobuf is much faster than something like JSON. The multitude of code generators for protobuf make it really simple and easy to use multiple languages on the same data structures.
I used it in a trading system because it's a compact scheme for sending data across networks. It's also quite fast, and there's support for various languages. So you can have a feed handler blasting out prices using a c++ implementation, with a GUI drawing a chart written in c#.
Serializing data for RPC, network protocols or storage, description and serialization of configuration, serializable state, serializing complex types for cryptographic signing, etc.
Why is it useful? The schema both documents the data structure and allows mappings to natural APIs in many different languages. Parsers and encoders are generated for you, and are fast.
At Badoo we use them to have a unified API for all of our platforms (Web, Mobile Web, Android, iOS, Windows Phone etc). This would not have been possible without something like ProtoBuf.
A reasonable move is to version said API and have an ops team that ensures that all in-use versions of the API stay running. Some consumers will be on the bleeding edge, your team's application for example while others will lag behind.
Using proto* in this case is a reasonable move because you gain multiple benefits, performance being perhaps the least important in this case. Having a defined schema for your API provides some level of natural documentation for the API. Code generation allows your team to publish trusted client libraries for multiple languages.
I'll specifically call out client libraries since I've seen it make a dramatic difference in organizational efficiency, mostly to do with team to team trust levels. Without a client library the testing situation becomes a significant burden, read up on contract testing. When the team that's publishing an API also creates the client that most directly calls that API, the client library is the testing surface instead of every consumer of the API needing to test the API itself for regressions.