Hacker News new | ask | show | jobs
by jiggawatts 795 days ago
This copied the superficial data layout without the key benefit of modern columnar formats: segment elimination.

Most such formats support efficient querying by skipping the disk read step entirely when a chunk of data is not relevant to a query. This is done by splitting the data into segments of about 100K rows, and then calculating the min/max range for each column. That is stored separately in a header or small metadata file. This allows huge chunks of the data to be entirely skipped if it falls out of range of some query predicate.

PS: the same compression ratio advantages could be achieved by compressing columns stored as JSON arrays, but such a format could encode all Unicode characters and has a readily available decoder in all mainstream programming languages.

2 comments

It's there, specified as an optional feature.

> Price⇥⇥0 {rows:2, distinct:2, minvalue:111.11, maxvalue:222.22} 111.11⮐222.22⮐

> Price⇥⇥1 {rows:1, distinct:1, minvalue:333.33, maxvalue:333.33} 333.33⮐

I like your idea of storing columns as JSON arrays. I might play around with that. Thanks for giving it a look.
I have a sinking feeling like I’ve unleashed something here.

Some future programmer will be cursing my name as they try to make columnar JSON decoding performant.

hehe. I added an alternative JSON inner format spec to the readme. I need to add JSON and CSV support to the zsvutil itself next. I may actually change the spec to default to JSON. All because of you. haha.