Hacker News new | ask | show | jobs
by adrianN 2674 days ago
I feel like if you need to parse Gigabytes per second of JSON, you should probably think about using a more efficient serialization format than JSON. Binary formats are not much harder to generate and can save a lot of bandwidth and CPU time.
5 comments

I have in the past parsed terabytes of JSON. The specific use case was analysing archived Reddit comments. The Reddit API uses JSON, and somebody [1] runs a server that just dumps them in a file, one line of JSON per comment, and offers them for download (compressed, obviously). So now you end up with Gigabytes of small JSONs per month, and anything you do will be quickly dominated by JSON parsing time.

You could store them in some binary format, but the API response format changed over the years with various fields being added and removed, and either your binary format ends up not much better than JSON or you end up reencoding old comments because the API changed.

1: http://files.pushshift.io/reddit/

The parsed format in tape.md is quite close to the flatbuffer format. Flatbuffer can encode any json file just fine. The parse time is immediate and requires no extra memory.

It’s a great way to store big json files where you only want to access a subset of data very quickly and not load the whole file into memory.

https://google.github.io/flatbuffers/

> either your binary format ends up not much better than JSON or you end up reencoding old comments because the API changed

Those are other options too, eg, storing the schema separately from the records (then batching records with identical schemas in compact binary files) and defining migration rules between different schemas (eg, if schema A has required field "foo" while schema B has required field "foo" and optional field "bar" then data which follows schema A can be trivially migrated to schema B at read time without needing to reencode on disk).

https://avro.apache.org/docs/current/

Maybe they want to convert incoming JSON to a binary serialization format to save bandwith, storage and CPU time on the rest of the pipeline ;)
That’s a nice sentiment but we don’t always get to choose.
I agree. But JSON serialization is very complicated for very little gain. It would make it impossible to do things like opening the json file in an editor to change some property names. So watch out for premature optimization.
What if you're ingesting thousands or millions of small feeds? You might not have much control or desire to dictate format to your clients
Yeah not everyone, I’d even say the majority of people, are using software parsing libraries where they are in control of the input data format.
For storing stuff yourself, sure, but as a web developer, most data I consume is JSON served by some third-party REST API and the format they serve me is definitely not under my control. Anecdotally, most developers I know or have spoken to are in similar situations for a large portion of their data-processing needs (at least, for stuff that's not in a database, although even in DB's, JSON is increasingly popular for a number of reasons).

Even for output, there is the common case where your clients expect JSON because its the de facto standard and is super accessible (every language has parsers for it), so you have little choice but to serve your data as JSON.

The readme specifies that it’s not optimized for reading a large number of small files.
This would be an easy extension if you wanted to concatenate the files. The plumbing and API aren't there right now, but it isn't hard to see how to do it.