| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by marknadal 2269 days ago

Holy Cow! 2.5GB/s that is amazing.

Meanwhile I can barely get Chrome/NodeJS to parse 20MB in less than 100ms :(.

How useful (or useless) would Simdjson as a Native Addon to V8 be? I assume transferring the object into JS land would kill all the speed gains?

I wrote my own JSON parser just last week, to see if I could improve the NodeJS situation. Discovered some really interesting factoids:

(A) JSON parse is CPU-blocking, so if you get a large object, your server cannot handle any other web request until it finishes parsing, this sucks.

(B) At first I fixed this by using setImmediate/shim, but discovered to annoying issues:

(1) Scheduling too many setImmediates will cause the event loop to block at the "check" cycle, you actually have to load balance across turns in the event loop like so (https://twitter.com/marknadal/status/1242476619752591360)

(2) Doing the above will cause your code to be way slow, so a trick instead, is to actually skip setImmediate and invoke your code 3333 (some divider of NodeJS's ~11K stack depth limit) times or for 1ms before doing a real setImmediate.

(C) Now that we can parse without blocking, our parser's while loop (https://github.com/amark/gun/blob/master/lib/yson.js) marches X byte increments at a time (I found 32KB to be a sweet spot, not sure why).

(D) I'm seeing this pure JS parser be ~2.5X slower than native for big complex JSON objects (20MB).

(E) Interestingly enough, I'm seeing 10X~20X faster than native, for parsing JSON records that have large values (ex, embedded image, etc.).

(F) Why? This happened when I switched my parser to skip per-byte checks when encountering `"` to next indexOf. So it would seem V8's built in JSON parser is still checking every character for a token which slows it down?

(G) I hate switch statements, but woah, I got a minor but noticeable speed boost going from if/else token checks to a switch statement.

Happy to answer any other Qs!

But compared to OP's 2.5GB/s parsing?! Ha, mine is a joke.

5 comments

bjoli 2269 days ago

I did a small benchmark on machine machine last time simdjson was up for discussion and back then it was faster than /bin/cat on my machine

link

mianos 2269 days ago

This comment was right at the bottom. It was so funny I just spit my coffee.

link

bjoli 2268 days ago

the thing is, it really was faster than gnu cat. I suspect it is because gnu cat does other things than just using Linux splice to a file descriptor and has options to count lines and such, and doesn't (didn't?) bother to use SSE. I just thought cat would give me a practical maximum to compare to when reading from disk.

link

huhnmonster 2269 days ago

I've also written and tried to optimize a hand-rolled JSON parser for exchange messages, just to see how fast pure JS could go. I tried many different things, but I only ever got near to the native implementation once I started assuming certain offsets in the buffer or optimistically parsing whole keys which were highly unsafe. My verdict was that you will never really get close to native, let alone close to hand-optimized C/C++.

link

wingi 2269 days ago

The native parser is C++.

link

zbjornson 2269 days ago

The interchange into v8 is indeed an issue, see another comment: https://news.ycombinator.com/item?id=22745941.

> JSON parse is CPU-blocking, so if you get a large object, your server cannot handle any other web request until it finishes parsing

Well, your CPU core is busy on one request or another, so I don't understand why this is an issue as long as you're guarding against maliciously large bodies. Blocking I/O is different because your core is partially idle while other hardware is doing async work. Using Node.js' cluster module lets you keep more cores busy. Chunking CPU-limited work increases total CPU time and memory required. (This is a pet peeve of mine and a hill I'm willing to die on :-) .)

link

marknadal 2268 days ago

I think that is a good hill to die on, tho I would rather prioritize UX (browser not freezing) and server responsiveness. Ideally we'd have no CPU chunking & good UX, but if we have to choose one, which would you sacrifice?

link

imtringued 2269 days ago

There are third party bindings for nodejs https://github.com/luizperes/simdjson_nodejs. As you suspected, converting the entire document to a JS object is not recommended. [0] There is an additional API that allows you to query keys without conversion.

[0] https://github.com/luizperes/simdjson_nodejs/issues/5

link

luizperes 2268 days ago

Yes, that is correct. I spent a lot of time on issue #5 to make as user-friendly as I could, but the only way I found to not have all the C++/JS conversion overhead was to keep the pointer to the external C++-parsed object. There might have other options that I haven't thought of, so if anyone knows of a better approach, let me know.

link

ksherlock 2269 days ago

... Why? This happened when I switched my parser to skip per-byte checks when encountering `"` to next indexOf.

Q: What happens when you parse "\\" ?

link

marknadal 2269 days ago

If string[index-1] === `\\` Then skipAgain

link

ksherlock 2269 days ago

How does that differentiate "\\" vs "\" ?

link