| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by makmanalp 795 days ago

The core idea with these compressed columnar "big data" formats is that they minimize storage accesses. Nothing else really matters as much. If you're gonna have to load a sizable chunk of the file to get to the bits you need, the format you store it in starts mattering less.

What this gets right: Part of the reason you want to store columns together is that similar values compress well, so you could reduce your IO: smaller files are faster to load into memory. However in many cases (e.g. Arrow, Parquet) lightweight compression formats are preferred here, e.g. run length encoding (1,1,1,1,1,1,5,5,5,5,3,3,3 -> 6x1,5x4,3x3) or dictionary encoding (if your column is enum-like, you can store each enum value as a byte flag) because they can be scanned without decoding, amplifying your savings.

What it misses on (IMHO): - There's a metadata field but it doesn't contain any offsets to access a specific column quickly. So if you have 8 columns of 2GB each, to just get to the 7th column you have to read 12GB first which is quite wasteful. If you store just an offset, you could be reading a handful of bytes. Massive savings. - Within each column, how do you get to the range of values you want? Most columnar formats have stripes (i.e. stored in chunks of X rows each) which contain statistics (this stripe or range of values contains min value A, max value B) that allow you to skip chunks really fast. So again within that 2GB you have to read not much more than you strictly have to.

If this reminds you of an on-disk tree where you first hop to a column and then hop to some specific stripes, yeah, that's pretty much the idea.

-----

Sidenote: I've generally concluded that "human readable" is only a virtue for encoding formats that aren't doing heavy lifting, like the API call your web app is sending to the backend. Even in that case, your HTTP request is wrapped in gzip, wrapped in TLS, wrapped in TCP and chunked to all hell. No one complains about the burden caused by those. So what's one more layer of decoding? We can just demand to have tools that are not terrible, and the result is pretty transparent to us. The format is mostly for the computer, not you.

When I hear about stuff like terabytes of JSON just being dumped into s3 buckets and then consumed again by some other worker I have a fit because it's so easy and cheap these days not to be that wasteful.

1 comments

hafthor 795 days ago

Thanks for taking a look. Regarding seekable columns, that's the reason why I use the ZIP file format. It has a central directory at the end of the ZIP file that has locations to each file inside, making it so you can seek to a specific column file to extract.

link

makmanalp 795 days ago

Oh interesting - I missed that tidbit! So with that and row groups and the metadata (assuming you have one metadata block per column in each row group) I think you get to full seekability, right?

link