Hacker News new | ask | show | jobs
by orthoxerox 795 days ago
It is simple, but how do you access the price in row #1234567890? If your data doesn't have this many records and can fit into RAM, a basic NLJSON or CSV will work just as well.
3 comments

Like parquet this isn't really meant for RDBMS type of database, more like for analytics over large datasets. I work in an environment where we typically have tables with over 300 columns, 10s if not 100s millions of rows daily. When you want to do a simple sum/group by involving 2 or 3 columns, it is great to have a column store file format, where you only read the columns you need and those are compressed.

The price you pay is that it is inefficient for single record access, or for "select * " kind of queries.

I was comparing it with Parquet, which is much more complex, but has features that help you access the data in less than O(n), like row groups and pages.
you mentioned NLJSON and CSV, which would require to read all columns from the disk.
Yes, but you would usually have to read at least two columns anyway. What are the datasets that are too large to be ingested completely, but too small for a proper columnar format?

If ZSV is meant to occupy the gap between CSV/NLJSON (smaller datasets) and Parquet/DuckDB (larger datasets), this niche is actually really small, if not nonexistent.

yes it's unclear to me what is the advantage over parquet with compression. And there are enough file formats flying around already.
Even with an OLAP use case, you're most often not scanning every row in the database if you even have a single where clause / conditional filter which is almost always. You need to have some level of locality and if your format doesn't support that, that'll be enough to kill performance.

Also parquet has lots of features that'll get you to the general vicinity of a single record tolerably fast without sacrificing much in terms of storage or computational complexity. It's a small price for a big win.

There's two ways to limit the number of column-rows you have to read. One is by file partitioning, that is having many ZSV files rather than one giant one, ideally organized by partitioning key field(s). The other way is mentioned as an extension to the format itself which functions much like rowgroups do in Parquet. https://github.com/Hafthor/zsvutil?tab=readme-ov-file#row-gr...

Thanks for taking a look.

Oh, sorry, I must've missed the part about rowgroups and metadata. Yes, this should work to limit the scans to a reasonable amount.
What is NLJSON?
Also known as JSONL, or JSON Lines. Basically a file of JSON objects separated by newlines. Popular format for logs these days for obvious reasons.
NDJSON is the shorthand I've seen: https://github.com/ndjson/ndjson-spec
https://jsonlines.org/ was the first "this is trivial but let's write it down so maybe the name will stick" spec for it (from 2013ish)
Missed opportunity to just call it JSONS.
New Line delimited JSON