Hacker News new | ask | show | jobs
by sandstrom 938 days ago
Neat library!

We're using randomly generated strings for many things. IDs, password recovery tokens, etc. We've generated millions of them in our system, for various use-cases. Hundreds of thousands of people see them every day.

I've never heard any complaints about a random content-id being "lR8vDick4r" (dick) or whatever.

But nowadays our society is so afraid of offending anyone, that profanity filters has extended all the way to database IDs and password recovery tokens.

(there are some legit cases, like randomly generated IDs for user profiles shared in public URLs, that users have to live with, but even there just make the min length 8 and you're unlikely to have any full-word profanity as the complete ID; put differently, I don't understand why they made the block list an opt-out thing)

4 comments

The block list is 2/3 of the (minified) library. I found this entire choice odd.

First, it's highly incomplete because you can find at least 10x more combinations spelling the same "word". And probably 10x more slurs that aren't in this block list. Second, because it's hardcoded in your source. Third, because there are more elegant solutions.

Such as to pick an alphabet that can't spell readable words unless you're trying really hard to read a slur into it. Say this (no vowels or digits):

bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ (length 42)

The full lower+upper+digits alphabet they use is 62. Feels like you're losing a lot, but... not really.

- A 128-bit id in base 62 = 22 letters.

- A 128-bit id in base 42 = 24 letters.

JUST TWO MORE LETTERS. And it's one more letter for 64-bit id (11 vs 12). And we can avoid this entire silliness. The problem is the author doesn't realize that logN is... logarithmic, I suppose.

A slight mod, I'd remove Y despite not exactly a vowel, and add back digits that can't be interpreted as vowels.

bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679

Gives us base 45. And below is a JS snippet to make an id. There's your lib.

    function id(num) {
        num = BigInt(num);
        const dict = "bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679";
        let id = '';
        while (num > 0n) {
            id += dict[Number(num % 45n)];
            num /= 45n;
        }
        return id || dict[0];
    }
Example:

    id(123456789012345678901234567890n);

    "bq99hC6fbtjLrkxLPm"
Totally agree. 2/3 is wild, especially given it seems like you could mitigate most of the risk just by removing vowels from the dictionary.
I'm actually more convinced the problem is corporate risk management to deal with the tendencies of social media to overhype issues by design, rather than a statement on society
There's other "general" rules when it comes to random human-readable tokens such as not using Os and Is if your strings include numbers - people can and will confuse them with 0s and 1s if they have to type them over.

Most gift card tokens for example don't allow the use of those two (or quietly correct it) to avoid making that mistake.

California State Driving License and or ID Numbers are One Alphabet & then 7 digits. But always, if the character at second place is Zero, everybody reads it as Alphabet O.
Well if you don't filter, things like this may happen:

https://github.com/compiler-explorer/compiler-explorer/issue...

Me and you will know it's just random characters, but if HR enters the chat...