| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ryebit 1871 days ago

In addition to BIP39 cited below, the EFF also published some useful wordlists a few years ago... https://www.eff.org/deeplinks/2016/07/new-wordlists-random-p...

One has the nice property that all words have unique 3 letter prefix. But not as many in prefix list (1296) as their "long" list (7776).

That said I'm kinda partial to BIP39... first four letters are unique, and words are more uniform than EFF prefix list.

But it looks like GPS addressing schemes like w3w need a MUCH larger list by an order of magnitude.

1 comments

kybernetikos 1871 days ago

> But it looks like GPS addressing schemes like w3w need a MUCH larger list by an order of magnitude.

If you're trying to get down to 3 words yes, but if you're happy with 4 words (and I am), then the long list would be more than enough. One problem I found was that I think people don't want strongly negative phrases being used to describe where they live (rural poverty assault, evil disease island). The BIP39 wordlist can create word groups that are very offensive, or that would feel rascist if applied to particular parts of the world. It also has words that are easily confused for other words like era/error, son/sun, alter/altar, aisle/I'll, floor/flaw, .

I did use words from BIP39, but had to remove quite a few in the end for my wordlist (e.g. blast, load, black, finger, female etc.) because of the unfortunate clusters it could create.

Ultimately, I think coming up with a good wordlist is still a bit of an unsolved problem. The ideal wordlist for something like this

1. can't form obviously rascist or overtly sexual word clusters

2. can be easily distinguished in spoken communication

3. can be easily distinguished in written communication

4. doesn't have words that sound similar to other words in any of the most common accents

5. doesn't have geographic words (it'd be confusing)

6. is mainly positive or neutral words

7. consists of words that are common and easy to spell, avoiding words that are commonly misspelled and where there are different standard ways of spelling the words depending on region

8. has words that are not too long, a small number of syllables

9. doesn't contain words that are concatenations of other words in the wordlist

Obviously not easy if you need a significant number of them.

link

jasonwatkinspdx 1871 days ago

I did the same thing as you and had the exact same experience. S2 made the mapping trivial, and I spent nearly all time on the word list.

I was really surprised to find there's not much out there in the way of cross language most commonly used word lists. I assume such lists are out there somewhere in the computational linguistics community but I couldn't find them. I ended up using a list of the most common english words, filtered via pairwise levenstien distance, and then I did a manual scan to drop any words that seemed problematic.

It really would be nice if someone would solve this, but I'm not being flippant about just how much effort would be involved.

link

kybernetikos 1870 days ago

Is there somewhere I can see the wordlist you came up with? My wordlist experiments are mostly here: https://github.com/kybernetikos/wherewords/tree/main/lib/wor...

link

jasonwatkinspdx 1870 days ago

Sorry no. I never got around to putting the code up on github and that laptop died. My list wasn't great either anyhow. This is an annoyingly tough problem.

link