| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by orthoxerox 795 days ago
	It is simple, but how do you access the price in row #1234567890? If your data doesn't have this many records and can fit into RAM, a basic NLJSON or CSV will work just as well.

3 comments

cm2187 795 days ago

Like parquet this isn't really meant for RDBMS type of database, more like for analytics over large datasets. I work in an environment where we typically have tables with over 300 columns, 10s if not 100s millions of rows daily. When you want to do a simple sum/group by involving 2 or 3 columns, it is great to have a column store file format, where you only read the columns you need and those are compressed.

The price you pay is that it is inefficient for single record access, or for "select * " kind of queries.

link

orthoxerox 795 days ago

I was comparing it with Parquet, which is much more complex, but has features that help you access the data in less than O(n), like row groups and pages.

link

cm2187 795 days ago

you mentioned NLJSON and CSV, which would require to read all columns from the disk.

link

orthoxerox 794 days ago

Yes, but you would usually have to read at least two columns anyway. What are the datasets that are too large to be ingested completely, but too small for a proper columnar format?

If ZSV is meant to occupy the gap between CSV/NLJSON (smaller datasets) and Parquet/DuckDB (larger datasets), this niche is actually really small, if not nonexistent.

link

cm2187 794 days ago

yes it's unclear to me what is the advantage over parquet with compression. And there are enough file formats flying around already.

link

makmanalp 795 days ago

Even with an OLAP use case, you're most often not scanning every row in the database if you even have a single where clause / conditional filter which is almost always. You need to have some level of locality and if your format doesn't support that, that'll be enough to kill performance.

Also parquet has lots of features that'll get you to the general vicinity of a single record tolerably fast without sacrificing much in terms of storage or computational complexity. It's a small price for a big win.

link

hafthor 795 days ago

There's two ways to limit the number of column-rows you have to read. One is by file partitioning, that is having many ZSV files rather than one giant one, ideally organized by partitioning key field(s). The other way is mentioned as an extension to the format itself which functions much like rowgroups do in Parquet. https://github.com/Hafthor/zsvutil?tab=readme-ov-file#row-gr...

Thanks for taking a look.

link

orthoxerox 794 days ago

Oh, sorry, I must've missed the part about rowgroups and metadata. Yes, this should work to limit the scans to a reasonable amount.

link

CapitalistCartr 795 days ago

What is NLJSON?

link

chuckadams 795 days ago

Also known as JSONL, or JSON Lines. Basically a file of JSON objects separated by newlines. Popular format for logs these days for obvious reasons.

link

rzzzt 795 days ago