Hacker News new | ask | show | jobs
by Uptrenda 911 days ago
A few other bases that are interesting:

Base36: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ

Good encoding for binary data in textual contexts. Such as where you have parameter inputs or database fields that are constrained and only accept certain characters. The lack of spaces means that it can be used on the command-line easily. Example use: IRC channel names.

Base64: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Same as above but it adds lower-case alphabet characters. This is important because as you restrict the number of characters allowed in a byte: the length of the string goes up massively. With more characters the coding is more efficient. Example use: YouTube video ids.

Base92: "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~!@#$%^&*()_+{}\"<>?`-=[];',./|"

Base92 is every character you can make on a standard key-board (I've replaced space with pipe here.) It includes many characters that have special meanings on the command-line or may be used as delimiters in text-based protocols. So while this offers a more 'efficient' encoding scheme for binary data it may break in some contexts. It's best where the input allows for typical formatting. Example use: forum / chat messages.

BaseN encoding schemes are interesting because they allow you to use standard text-fields in many systems to store binary data. The most well-known here is base64 which allows browsers to embed whole files as text and store them directly in the HTML. Some sites use these for optimization hacks.

3 comments

That is not base64, it's base62. You can tell because it only has 62 symbols. To get base64 you have to add 2 symbols that you arbitrarily select from the master "table of symbols to add to base62 to get to base64 depending on what the platform is and what characters are restricted in it" [1]. For instance you might use `@`, except in an email. Or `/`, but not in an fs path or URL.

As for base92, those symbols might all be easy to enter on your keyboard, but on international layouts the process can be quite involved indeed.

I prefer base36 for this reason. Want a compact random string? Math.random().toString(36). Watch out to prefix it with a char for settings that disallow leading digits through! (variable identifiers, css class names, etc.)

[1] https://en.wikipedia.org/wiki/Base64#Variants_summary_table

Base62 is fantastic for URL-friendly encoding. I use GUIDs for primary keys in my web app, and encode them for frontend consumption using Base62. Looks much neater and doesn't cause issues like Base64 extra characters might.
>Base64: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

This is only 62 characters!

Suspect they mistyped `Base62` (since they seem to be favouring non powers of two)
Base64 uses / and +
Sometimes. Other times those chars are not allowed in the embedding context (paths, for instance), so you have to use '+' and ','. Or maybe '_' and '-'.
If you follow rfc4648, those are "base 64" but not "base64":

> This encoding [using '-' and '_'] may be referred to as "base64url"... Unless clarified otherwise, "base64" refers to the base 64 in the previous section ['+' and '/'].

> base36 textual contexts

Better IMO is base 32 with U (obscenity), 0/O (ambiguity), and I (ambiguity) removed.

Removing characters for obscenity is pointless (thousands way to evade this "filter"), english-centric and honestly a weird idea.

I've always heard that the reason in another ambiguity (u/v) which makes more sense to me.

Base64/Base32/ASCII is English-centric.

Might be weird to you personally, but there's literally government agencies to prevent obscenities.

What makes the letter U obscene?
You can make the word fuck with it. That upsets children on the internet.
I doubt that upsets any children on the internet; more likely it upsets some adults on behalf of children on the internet.
If that's what you're trying to avoid, it will be a lot more effective to remove F.
Might as well go for Base27 then. Strip out all of the vowels and you can't accidentally make naughty words any more.
That's Crockford Base32, not RFC Base32

https://en.m.wikipedia.org/wiki/Base32

Crockford is a bit different, and normalizes I/1/O/0 on parsing.
Do we really expect humans to read baseX encodings directly to make it worth to have ambiguity checks?
Sometimes. Imagine if this is being used to generate something like a DOI or other catalog number for some data or physical artifact. As research scales up, the size of these identifiers also benefits from a more compact encoding.

These kinds of IDs might be printed in a research paper (perhaps in a figure caption or bibliography/reference entry). Then, someone might be reading this from a printed copy of the paper rather than a PDF with a link in it.

Or, researchers might be verbally referencing a particular item during some meeting. It might be recognizable among some peers actively working with the same artifacts, but might also need to be typed back into some search form to get back to online metadata etc.

Another place the same identifier might be is on a printed label for physical artifacts in an archive. Of course, you might also want something like a 2D barcode for scanning, but it is helpful to have something human readable.

Removing U just means your CD key begins with FCKGW
So.. crockford32 mentioned in the article?