Hacker News new | ask | show | jobs
by Aardwolf 940 days ago
A thing I wonder: why is using = padding required in the most common base64 variant?

It's redundant since this info can be fully inferred from the length of the stream.

Even for concatenations it is not necessary to require it, since you must still know the length of each sub stream (and = does not always appear so is not a separator).

There's no way that using the = instead of per-byte length-checking gains any speed, since to prevent reading out of bounds you must check the per byte length anyway, you can't trust input to be a multiple of 4 length.

It could only make sense if it's somehow required to read 4 bytes at once, and you can't possibly read less, but what platform is such?

5 comments

from Wikipedia:

  The padding character is not essential for decoding, since the number of missing bytes can be inferred from the length of the encoded text. In some implementations, the padding character is mandatory, while for others it is not used. An exception in which padding characters are required is when multiple Base64 encoded files have been concatenated.
IMO padding is not necessary and just a relic of old implementations.
I think so too. It feels similar to how many specifications from the 90s use big endian 4-byte integers for many things (like png, riff, jpeg, ...) despite little endian CPU's being most common since the 80s already, and those specifications seemingly assuming that you would want to decode those 4-byte values with fread without any bounds checking or endianness dependency.
Without padding, how would you encode, for example, a message with just a single zero? To be more precise, how do you distinguish it from two zeroes and three zeroes?
Both for encoding and decoding the padding is not needed. Without ='s, you get a uniquely different base64 encoding for NULL, 2 NULLs and 3 NULLs.

This shows the binary, base64 without padding and base64 with padding:

NULL --> AA --> AA==

NULL NULL --> AAA --> AAA=

NULL NULL NULL --> AAAA --> AAAA

As you can see, all the padding does is make the base64 length a multiple of 4. You already get uniquely distinguishable symbols for the 3 cases (one, two or three NULL symbols) without the ='s, so they are unnecessary

Oh right. Problems only show up when you concatenate two messages, because a single null is AA, but two nulls are AAA, not AAAA.
The output padding is only relevant for decoding. For encoding, since the alphabet of Base64 is 6 bits wide, the padding is 0 when the input is not a multiple of 6 (e.g. encoding two bytes (16 bits) needs two more bits to become a multiple of 6 (18))

Refer to the "examples" section of the wikipedia page

Perhaps to simplify implementations that read multiple characters at a time?

But I think it's likely just poor design taste.

> Even for concatenations it is not necessary to require it, since you must still know the length of each sub stream

I'm not sure I understand this part. You can decode aGVsbG8=IHdvcmxk, what do you need to know?

The = does not appear if the base64 data is a multiple of 4 length. So you wouldn't know if aGVsbG8I is one or two streams. The = is not a separator, only padding to make the base64 stream a multiple of 4 length for some reason.

I only mentioned the concatenation because Wikipedia claims this use case requires padding while in reality it doesn't.

Base64 doesn't have a concept of "stream". Conceptually base64-encoded string with padding is a concatenation of fragments that are always 4 bytes long but can encode one to three bytes. Concatenating two base64-encoded strings with padding therefore don't destroy fragment structures and can be decoded into a byte sequence that is a concatenation of two original input sequences. Without padding, fragments can be also 2 or 3 bytes and short fragments are not distinguishable from long fragments, so the concatenation will destroy fragment structures.
Oh I see, so it's for concatenating multiple base64 fragments of the same single piece of data? But where is this used? Never seen that. Javascript's base64 decoder gives an error for ='s in the middle (but I just found out the Linux base64 -d command supports it!)
I actually don't know if it's an intention, but it is the only explanation that makes sense. It should be noted that the original PEM specification (RFC 989) did have a similar use case where alternating encrypted and unencrypted bytes can be intermixed by `*` characters, but you are still required to pad each portion to 4n bytes (e.g. `c2VjcmV0LCA=*cHVibGlj*IGFuZCBzZWNyZXQgYWdhaW4=`). It is still the closest to what I think padding characters are required for.
It would decode correctly but you wouldn't know the boundary, if that matters. I see, thanks.