|
|
|
|
|
by wongarsu
2673 days ago
|
|
I have in the past parsed terabytes of JSON. The specific use case was analysing archived Reddit comments. The Reddit API uses JSON, and somebody [1] runs a server that just dumps them in a file, one line of JSON per comment, and offers them for download (compressed, obviously). So now you end up with Gigabytes of small JSONs per month, and anything you do will be quickly dominated by JSON parsing time. You could store them in some binary format, but the API response format changed over the years with various fields being added and removed, and either your binary format ends up not much better than JSON or you end up reencoding old comments because the API changed. 1: http://files.pushshift.io/reddit/ |
|
It’s a great way to store big json files where you only want to access a subset of data very quickly and not load the whole file into memory.
https://google.github.io/flatbuffers/