Hacker News new | ask | show | jobs
by jjeaff 940 days ago
and yet, Internet protocols (http, at least) don't play well with equal signs which are part of base64, sometimes. That little issue has caused lots of intermittent bugs for me over the years, either from forgetting to urlencode it or not urldecoding it at the right time.
6 comments

So there are 7 base64 encodings, one with “+ / =“, one with “- _ =“, one with “+,” and no “=“… https://en.wikipedia.org/wiki/Base64#Variants_summary_table
And decoders typically aren't interoperable, requiring you to use the specific decoder for that combination.
Which is silly, because there’s no good reason to, except for strict validation.
TIL.

And Python uses RFC 4648

Python might say that, but as often it’s not really true: it really mostly works off of 2045

- the “default” encoder (“b64encode”) will pad the output

- although it will not linebreak (“encodebytes”) does that)

- the default decoder will error if the input is not padded

- the default decoder will ignore all non-encoding characters by default

Also both b64encode and encodebytes actually use binascii.b2a_base64, which claims conformance to RFC 3548, which attempts to unify 1421 and 2045. Except RFC 3548 requires rejecting non-encoding data, whereas (again) Python accepts an ignores it by default, in 2045 fashion.

And slashes as well, which is a magic character in both urls and file systems. Means you can't reliably use normal base64 for filenames, for instance. That might seem like a niche use-case, but it's really not, because you can use it for content-based addressing. Git does this, names all the blobs in the .git folder after their hash, but you can't encode the hash with regular base64.
There’s the URL- and filename-safe variant of Base64 [0]. Decoders can support it simultaneously and transparently.

[0] https://www.rfc-editor.org/rfc/rfc4648.html#section-5

you can also manually replace the with urlsafe codes
Ditto the obnoxious "quoted-printable" mail encoding, which turns every = into =3D.

Still more robust than uuencode though.

It's basically the same as URL encoding, they just picked = instead of %
It is, plus extra segmenting with `=` escaped line breaks [1]:

> Lines of Quoted-Printable encoded data must not be longer than 76 characters. To satisfy this requirement without altering the encoded text, soft line breaks may be added as desired. A soft line break consists of an =

IIUC in Base64 you can throw whichever white space anywhere and it should be ignored. And in URL ("percent") encoding there is no insignificant white space possible (?) and encoding of white space depends on implementation (dreaded space `%20` vs ` ` vs `+` in application/x-www-form-urlencoded [2]).

[1] https://en.wikipedia.org/wiki/Quoted-printable [2] https://en.wikipedia.org/wiki/Percent-encoding

I am using base62 for data that can be included in URIs.
all three symbols are some of the worst possible choices for compatibility with urls and many other things

.-_ would have been a better choice tha +/=

base64 is older than URLs, though.
And now we can have whitespace in url queries but we are still using %20 everywhere because "that's standard"...
Try copy-pasting a link that has actual whitespace in its URL queries and see if it gets linkified correctly. Just because you can doesn't mean you should! A space is like the one delimiter that is applicable for separating out URLs from the context of a larger blob of text.
Browsers will often display %20 as a space, but that's not the same thing as spaces being legal within URLs.
You are right. Seems firefox displays %20 as whitespace and converts whitespace to %20 when you use it. Chrome displays it as %20 but still converts whitespace to %20 if you try to use it.
Space is not legal at the HTTP request level, because the opening line uses space as a delimiter like:

    GET /your/path-to/the.file HTTP/1.1
Have fun with newline and spaces/tabs conversions when allowing whitespace in URLs.