Hacker News new | ask | show | jobs
by sl-dolt 1204 days ago
Absolutely interested, on my end at least. I wrote this to manage the transparency in coverage files: https://github.com/dolthub/data-analysis/tree/main/transpare... but I'm always looking for better techniques.

Edit: Oh wow, I see you used it on those exact files. How about that.

2 comments

Ha! Thanks to you, Today I found out how big those uncompressed JSON files really are (the data wasn't accessible to me, so i shared the tool with my colleague and he was the one who ran the queries on his laptop): https://www.dolthub.com/blog/2022-09-02-a-trillion-prices/ .

And yep, it was more or less they way you did with ijson. I found ijson just a day after I finished the prototype. Rapidjson would probably be faster. Especially after enabling SIMD. But the indexing was a one time thing.

We have open sourced the codebase. Here's the link: https://github.com/multiversal-ventures/json-buffet . Since this was a quick and dirty prototype, comments were sparse. I have updated the Readme, and added a sample json-fetcher. Hope this is more useful for you.

Another unwritten TODO was to nudge the data providers towards a more streaming friendly compression formats - and then just create an index to fetch the data directly from their compressed archives. That would have saved everyone a LOT of $$$.

I used the rapidjson streams with my little embedded REST HTTP(s) server library: https://github.com/Edgio/is2/

We needed it for streaming large json from async server sockets.

Code link: https://github.com/Edgio/is2/blob/master/include/is2/support...

You just had to implement the interfaces like Peek/Take/Tell/etc. It worked really well for us.

Probably not as fast as simdjson, but they used some simd tricks I think for skipping whitespace:

https://rapidjson.org/md_doc_internals.html#SkipwhitespaceWi...