Hacker News new | ask | show | jobs
Show HN: "crzy64", base64 mod aimed for fastest decoding (github.com)
47 points by jpegqs 1644 days ago
7 comments

I want to say this looks interesting but it's so sparse on details it's hard to say. What modifications were made to base64? Is the only modification replacing "+" with "."? Is the rest of the table[1] unchanged? Can it be extended to only use URL-safe characters (like base64url)? How does this compare to other popular/optimized implementations of standard base64? What are its performance characteristics on different sizes of inputs? I've never had to decode 100 MB of base64 data, but I have had to decode a ton of 64-bit base64 strings, for example.

[1]: https://en.wikipedia.org/wiki/Base64#Base64_table

> Is the rest of the table[1] unchanged?

The rest is unchanged, but there is a trick in that the table indexes are shuffled, and shuffled differently depending on the neighboring values in a 24-bit block of source data.

> but I have had to decode a ton of 64-bit base64 strings, for example

Should be good for small data. It's a pity that no one (or I haven't seen such benchmarks) measures the performance of decoding a large amount of small data (like 1..64 bytes).

It would have been better to have an algorithm description somewhere, but it is not hard to follow. So crzy64 is essentially base64 plus two optimizations:

- Output bytes [A-Za-z0-9./] are shuffled so that it can be efficiently translated from and to the 0..63 range. The exact remapping is as follows:

    cDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz./0123456789AaBbC
- Bits are reordered and XORed with each other to make the unpacking from three 8-bit bytes to four 6-bit units extremely simple: specifically, `x ^ (x >> 6)`. The input bits `qrstuvwx ijklmnop abcdefgh` thus should be converted to the following four 6-bit units, XORed together (here represented with two-bit padding as in the actual implementation):

    00abcdef 00ijklgh 00qrmnop 00stuvwx
    0000ijkl 00000000 0000gh00 00mnop00
    000000qr 00000000 00000000 00gh0000
Since both optimizations can equally work well for SIMD and for SWAR (SIMD within a register), I guess they may even be useful for small inputs.
Great description, but there is a typo in the second row:

    00000000 abcdefgh ijklmnop qrstuvwx
    -------- -------- -------- --------
    00abcdef 00ijklgh 00qrmnop 00stuvwx
    0000ijkl 0000qr00 0000gh00 00mnop00 <-- missing "qr"
    000000qr 00000000 00000000 00gh0000
Still can't improve this stage of encoding, it requires a lot of operations. Maybe there is something trivial, but I don't see it.
Isn't base64 encoding/decoding bound to memory bandwidth? If so, won't all 4/3 memory formats encode/decode at the same speed?
Even when memory bandwidth is the bottleneck (which I’m not sure about), optimizations help.

Due to the nature of CPUs operating asynchronously, if one function is waiting for memory to be read, it can continue doing other things.

As such, if this base64 implementation is more efficient, even though the “wall clock” time is exactly identical due to the memory bandwidth, the CPU has more time to perform _other_ tasks.

Memory bandwidth is a complex beast, one should be able to get 50GB/s for short decodes of single to double digit k on modern hardware. The author measured 11GB/s ish in their memcpy benchmark, but only half that for decode. If memory bandwidth was the wall, then it should be closer to memcpy in perf. I could see fusing base64 decode and json parsing into a single function.

It would be cool to show a demo of a processor maintaining good IPC on one hyper thread while the other hyper thread running on the same core was able to do a base64 decode.

Memory bandwidth is the limit, so you can't make it faster than memcpy(). However, copying memory is still faster than these algorithms, аnd it's complicated to make the conversion speed closer to the speed of copying memory.
What you're saying is that memory bandwidth is a limit but not the current limiting factor. So your answer to GP is "no".
Only if you don't stream it to any more compute intensive process, so you'd need to leave cycles free for that, and you are reading it from a cache cold buffer instead of eg a network buffer which would be already in cache coming out of the net stack.
There has been work done one optimizing regular base64 using SSE:

http://www.alfredklomp.com/programming/sse-base64/

I know that. My approach is different: instead of finding the fastest solution to deal with the existing base64 standard, I modify the standard to make it easier to decode. (There are cases when it isn't necessary to follow the standard.)
> There is a difference with base64 as it uses "./" instead of "+/" and the data is also pre-shuffled to speed up decoding.

You know what else uses dot slash? The Unix crypt function.

Can you tell us what set of 64 characters it encodes to?
Going off the readme, it appears to use "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789./" though not necessarily in that order.
I wonder if they just reinvented UU-(en,de)code: https://en.wikipedia.org/wiki/Uuencoding

I guess it'll go... nowhere.

There's been prior art, I was immediately reminded of Ascii85

https://en.wikipedia.org/wiki/Binary-to-text_encoding#Encodi...

Incidentally every Windows desktop now has UUENCODE built in

  C:\>dir > %TEMP%\foo && tar -cf - --format shardump %TEMP%\foo
Finally I can extract those usenet archives. :)