Hacker News new | ask | show | jobs
by rcoveson 1968 days ago
Better to compare it to Cap'n Proto instead. Arrow data is already laid out in a usable way. For example, an Arrow column of int64s is an 8-byte aligned memory region of size 8*N bytes (plus a bit vector for nullity), ready for random access or vectorized operations.

Protobuf, on the other hand, would encode those values as variable-width integers. This saves a lot of space, which might be better for transfer over a network, but means that writers have to take a usable in-memory array and serialize it, and readers have to do the reverse on their end.

Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout, and Protobuf as a lightweight purpose-built compression algorithm for structs.

2 comments

> Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout

I just want to say thank you for this part of the sentence. I understand struct-of-arrays vs array-of-structs, and now I finally understand what the heck Arrow is.

Protobuf provides the fixed64 type and when combined with `packed` (the default in proto3, optional in proto2) gives you a linear layout of fixed-size values. You would not get natural alignment from protobuf's wire format if you read it from an arbitrary disk or net buffer; to get alignment you'd need to move or copy the vector. Protobuf's C++ generated code provides RepeatedField that behaves in most respects like std::vector, but in as much as protobuf is partly a wire format and partly a library, users are free to ignore the library and use whatever code is most convenient to their application.

TL;DR variable-width numbers in protobuf are optional.