Hacker News new | ask | show | jobs
by aeroevan 2624 days ago
It's 100GB compressed. Parquet does a very good job of compressing most data so that's where the estimate of 10x (so 1TB) uncompressed was mentioned as a rule of thumb.

Parquet also supports much better access mechanisms, like being able to deserialize a single column without having to read in entire rows.

But like you mentioned, 1TB of data in a traditional database isn't that bad.

2 comments

... also remembering that a traditional dB will typically not store data raw. Row compression is normal and disk compression is normal . The typical column store advantage is block compression, predicate pushdown and column order storage.
Regular databases such as SQL Server and Oracle have had columnar compression built in as an option along with the row stores for years now. I use it in SQL Server a lot and it works great.
you can run sql DB over compressed filesystem, and some DBs allow you to compress tables too

> like being able to deserialize a single column without having to read in entire rows.

and it reads filesystem's whole page anyway

Sorry for the late reply, but parquet is a columnar format so if it's big enough data, you should have multiple pages/blocks of data in a single column for a specific row group, and then be able to seek to the next row group and sequentially read the next set of blocks.