Hacker News new | ask | show | jobs
by stingraycharles 1644 days ago
Even when memory bandwidth is the bottleneck (which I’m not sure about), optimizations help.

Due to the nature of CPUs operating asynchronously, if one function is waiting for memory to be read, it can continue doing other things.

As such, if this base64 implementation is more efficient, even though the “wall clock” time is exactly identical due to the memory bandwidth, the CPU has more time to perform _other_ tasks.

1 comments

Memory bandwidth is a complex beast, one should be able to get 50GB/s for short decodes of single to double digit k on modern hardware. The author measured 11GB/s ish in their memcpy benchmark, but only half that for decode. If memory bandwidth was the wall, then it should be closer to memcpy in perf. I could see fusing base64 decode and json parsing into a single function.

It would be cool to show a demo of a processor maintaining good IPC on one hyper thread while the other hyper thread running on the same core was able to do a base64 decode.