| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sigil 3612 days ago

I'll give you an example.

I used protobuf as the output format for a web crawler. Workers read urls and sequentially write entire HTTP responses to disk. [0] Sure, you could serialize the responses to JSON, but the overhead of representing things like binary image data as escaped unicode strings was prohibitive in my case.

"Why not BSON?" Well, schemas can be nice when performance matters. Instead of solving a parsing problem at runtime, a C/C++ reader can contain a compiler-optimized deserializer for a given protobuf schema. It's almost like directly reading and writing an array of C structs, except protobuf is architecture-independent, and you can add new fields without breaking old readers.

There are plenty of reasons to not use protobuf. I particularly disliked the code generation step for C/C++. That makes even less sense in a language like Python, and yet that's exactly what the official python protobuf implementation from Google does (did?). I wrote a python protobuf library on top of a C protobuf library that avoids codegen: https://github.com/acg/lwpb

[0] See the ARC format used by the Internet Archive for a similar (and imo clunkier) solution. http://crawler.archive.org/articles/developer_manual/arcs.ht...