Hacker News new | ask | show | jobs
by ArrayBoundCheck 1372 days ago
Parsing JSON takes more time than IO
2 comments

All the object allocation, pointer chasing, and general indirection is what makes JSON parsing slow, particularly building up tree structures. Bump allocators and the like don't help much, at least for the parsing--most malloc implementations already do fast, O(1) allocation for small objects, and have for decades. (Arenas are useful for making deallocation O(1), though.)

If you care about raw JSON throughput, use Ragel or something similar to build a state machine that directly parses JSON into a flat, native data structure. Now you have zero allocations without even needing to shim malloc/free. AVX-512 would still be at least as useful, but it's a much more difficult problem to leverage SIMD in a parser generator than in a simple string escape routine or behind a more abstract interface like a regex library.

Quite a few language environments these days provide in-language JSON deserializers, but they're still significantly slower than they could be even when they deserialize to flat data structures. The macro languages and internal compiler intrinsics used to accomplish this are the worst possible environments for development. Lisp-like languages aren't really an exception as they tend to trade easier in-language transforms for a steeper climb when it comes to generating optimized native code for the transform.

I didn't care about speed. I was merely saying the waiting on I/O is false and didn't want to go as far as saying it hasn't been true for 15+ years
Is there any serialization format that is more "friendly" to memory allocators while still being human-readable and -writable?
It's not about the format per se, but more about the fact that you're parsing an unknown/fully general structure in that format. APIs like Rust's serde can help to avoid excessive allocation when you have JSON in a known schema.
If a file is so large that the processor spends a lot of time parsing, then the file is too large to be conveniently edited by a person.

For large files it is best to use a binary format that can be read quickly without parsing or allocation. https://rkyv.org/ is an example.

Being 'friendly' is not why JSON is popular. JSON is popular because the decoder is included the web browser.

Thanks, and good point about dataset size.

I appreciate the thorough "shootout" benchmarks provided by the authors as well!

text isn't readable without software. why should we expect binary data formats to be?
simdjson directly parses JSON into flat simdjson tape, which you can just use directly if you care to
This claims 3gb/s, saturating nvme full queued read. It’s not clear if they construct data or just parse it, but you have to construct data with any format. Other libs also do a great job. Also, io latency shoud be accounted for, because average jsons usually fit into one sdd block, afaiu.

https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-t...

(Edit: just realized this is the same site as subj, heh)

Yep. I think the fastest nvme's were a tad above 3gb and network IO is certainly faster. It always irks me when something stop being true 15-20years ago is cited today