Hacker News new | ask | show | jobs
by cm2187 795 days ago
Like parquet this isn't really meant for RDBMS type of database, more like for analytics over large datasets. I work in an environment where we typically have tables with over 300 columns, 10s if not 100s millions of rows daily. When you want to do a simple sum/group by involving 2 or 3 columns, it is great to have a column store file format, where you only read the columns you need and those are compressed.

The price you pay is that it is inefficient for single record access, or for "select * " kind of queries.

2 comments

I was comparing it with Parquet, which is much more complex, but has features that help you access the data in less than O(n), like row groups and pages.
you mentioned NLJSON and CSV, which would require to read all columns from the disk.
Yes, but you would usually have to read at least two columns anyway. What are the datasets that are too large to be ingested completely, but too small for a proper columnar format?

If ZSV is meant to occupy the gap between CSV/NLJSON (smaller datasets) and Parquet/DuckDB (larger datasets), this niche is actually really small, if not nonexistent.

yes it's unclear to me what is the advantage over parquet with compression. And there are enough file formats flying around already.
Even with an OLAP use case, you're most often not scanning every row in the database if you even have a single where clause / conditional filter which is almost always. You need to have some level of locality and if your format doesn't support that, that'll be enough to kill performance.

Also parquet has lots of features that'll get you to the general vicinity of a single record tolerably fast without sacrificing much in terms of storage or computational complexity. It's a small price for a big win.