Hacker News new | ask | show | jobs
by barishnamazov 99 days ago
You shouldn't be using JSON for things that'd have performance implications.
3 comments

As with most things in engineering, it depends. There are real logistical costs to using binary formats. This format is almost compact as a binary format while still retaining all the nice qualities of being an ASCII friendly encoding (you can embed it anywhere strings are allowed, including copy-paste workflows)

Think of it as a hybrid between JSON, SQLite, and generic compression. This format really excels for use cases where large read-only build artifacts are queried by worker nodes like an embedded database.

The cost of using a textual format is that floats become so slow to parse, that it’s a factor of over 14 times slower than parsing a normal integer. Even with the fastest simd algos we have right now.
So it depends. Float parsing performance is only a problem if you parse many floats, and lazy access might reduce work significantly (or add overhead: it depends).
Exactly. My for use cases, this format is amazing. I have very few floats, but lots and lots of objects, arrays and strings with moderate levels of duplication and substring duplication. My data is produced in a build and then read in thousands or millions of tiny queries that lookup up a single value deep inside the structure.

rx works very well as a kind of embedded database like sqlite, but completely unstructured like JSON.

Also I'm working on an extension that makes it mutable using append-only persistent data structures with a fixed-block caching level that is actually a pretty good database.

if you data is lots and lots of arrays of floats, this is likely not the format for you. Use float arrays.

Also note it stores decimal in a very compact encoding (two varints for base and power of 10)

That said, while this is a text format, it is also technically binary safe and could be extended with a new type tag to contain binary data if desired.

and with little data (i.e. <10Mb), this matters much less than accessibility and easy understanding of the data using a simple text editor or jq in the terminal + some filters.
what do you mean by little data, most communication protocols are not one off
Also good luck parsing 10 MiB of JSON in a loop that can't tolerate blocking the CPU for more than 10ms.

What's expensive is very relative to the use case.

I agree in principle. However JSON tooling has also got so good that other formats, when not optimized and held correctly, can be worse than JSON. For example IME stock protocol buffers can be worse than a well optimized JSON library (as much as it pains me to say this).
Yeah the raw parse speed comparison is almost a red herring at this point. The real cost with JSON is when you have a 200MB manifest or build artifact and you need exactly two fields out of it. You're still loading the whole thing into memory, building the full object graph, and GC gets to clean all of it up after. That's the part where something like RX with selective access actually matters. Parse speed benchmarks don't capture that at all.
> The real cost with JSON is when you have a 200MB manifest or build artifact and you need exactly two fields out of it.

There are SAX-like JSON libraries out there, and several of them work with a preallocated buffer or similar streaming interface, so you could stream the file and pick out the two fields as they come along.

You still have to parse half the entire file on average. Much slower than formats that support skipping to the relevant information directly.
yep, this is exactly the kind of use case that caused me to design this format.
as parser: keep only indexes to the original file (input), dont copy strings or parse numbers at all (unless the strings fit in the index width, e.g. 32bit)

That would make parsing faster and there will be very little in terms on tree (json can't really contain full blow graphs) but it's rather complicated, and it will require hashing to allow navigation, though.

yep. I built custom JSON parsers as a first solution. The problem is you can't get away from scanning at least half the document bytes on average.

With RX and other truly random-access formats you could even optimize to the point of not even fetching the whole document. You could grab chunks from a remote server using HTTP range requests and cache locally in fixed-width blocks.

With JSON you must start at the front and read byte-by-byte till you find all the data you're looking for. Smart parsers can help a lot to reduce heap allocations, but you can't skip the state machine scan.

Can you imagine if a service as chatty and performance sensitive as Discord used JSON for their entire API surface?