I want to say this looks interesting but it's so sparse on details it's hard to say. What modifications were made to base64? Is the only modification replacing "+" with "."? Is the rest of the table[1] unchanged? Can it be extended to only use URL-safe characters (like base64url)? How does this compare to other popular/optimized implementations of standard base64? What are its performance characteristics on different sizes of inputs? I've never had to decode 100 MB of base64 data, but I have had to decode a ton of 64-bit base64 strings, for example.
The rest is unchanged, but there is a trick in that the table indexes are shuffled, and shuffled differently depending on the neighboring values in a 24-bit block of source data.
> but I have had to decode a ton of 64-bit base64 strings, for example
Should be good for small data. It's a pity that no one (or I haven't seen such benchmarks) measures the performance of decoding a large amount of small data (like 1..64 bytes).
It would have been better to have an algorithm description somewhere, but it is not hard to follow. So crzy64 is essentially base64 plus two optimizations:
- Output bytes [A-Za-z0-9./] are shuffled so that it can be efficiently translated from and to the 0..63 range. The exact remapping is as follows:
- Bits are reordered and XORed with each other to make the unpacking from three 8-bit bytes to four 6-bit units extremely simple: specifically, `x ^ (x >> 6)`. The input bits `qrstuvwx ijklmnop abcdefgh` thus should be converted to the following four 6-bit units, XORed together (here represented with two-bit padding as in the actual implementation):
Even when memory bandwidth is the bottleneck (which I’m not sure about), optimizations help.
Due to the nature of CPUs operating asynchronously, if one function is waiting for memory to be read, it can continue doing other things.
As such, if this base64 implementation is more efficient, even though the “wall clock” time is exactly identical due to the memory bandwidth, the CPU has more time to perform _other_ tasks.
Memory bandwidth is a complex beast, one should be able to get 50GB/s for short decodes of single to double digit k on modern hardware. The author measured 11GB/s ish in their memcpy benchmark, but only half that for decode. If memory bandwidth was the wall, then it should be closer to memcpy in perf. I could see fusing base64 decode and json parsing into a single function.
It would be cool to show a demo of a processor maintaining good IPC on one hyper thread while the other hyper thread running on the same core was able to do a base64 decode.
Memory bandwidth is the limit, so you can't make it faster than memcpy(). However, copying memory is still faster than these algorithms, аnd it's complicated to make the conversion speed closer to the speed of copying memory.
Only if you don't stream it to any more compute intensive process, so you'd need to leave cycles free for that, and you are reading it from a cache cold buffer instead of eg a network buffer which would be already in cache coming out of the net stack.
I know that. My approach is different: instead of finding the fastest solution to deal with the existing base64 standard, I modify the standard to make it easier to decode. (There are cases when it isn't necessary to follow the standard.)
[1]: https://en.wikipedia.org/wiki/Base64#Base64_table