Hacker News new | ask | show | jobs
by wongarsu 2673 days ago
I have in the past parsed terabytes of JSON. The specific use case was analysing archived Reddit comments. The Reddit API uses JSON, and somebody [1] runs a server that just dumps them in a file, one line of JSON per comment, and offers them for download (compressed, obviously). So now you end up with Gigabytes of small JSONs per month, and anything you do will be quickly dominated by JSON parsing time.

You could store them in some binary format, but the API response format changed over the years with various fields being added and removed, and either your binary format ends up not much better than JSON or you end up reencoding old comments because the API changed.

1: http://files.pushshift.io/reddit/

2 comments

The parsed format in tape.md is quite close to the flatbuffer format. Flatbuffer can encode any json file just fine. The parse time is immediate and requires no extra memory.

It’s a great way to store big json files where you only want to access a subset of data very quickly and not load the whole file into memory.

https://google.github.io/flatbuffers/

> either your binary format ends up not much better than JSON or you end up reencoding old comments because the API changed

Those are other options too, eg, storing the schema separately from the records (then batching records with identical schemas in compact binary files) and defining migration rules between different schemas (eg, if schema A has required field "foo" while schema B has required field "foo" and optional field "bar" then data which follows schema A can be trivially migrated to schema B at read time without needing to reencode on disk).

https://avro.apache.org/docs/current/