Hacker News new | ask | show | jobs
by mrbukkake 1664 days ago
Nice idea, naive implementation which leads to the output being unconvincing as hypothetical English words. I had a brief look and it seems to be proportionally selecting and sticking together sequences of letters sampled from English words (lib/word-probability.ts). This doesn't take into account syllable boundaries, the way the English spelling system maps between phones/phonemes and the phonotactic properties of English which is why the output looks unconvincing.

A better approach would be to use a markov chain built from sampling English text letter-by letter... an even better approach would be to build your stats from some source of English words in IPA transcription with syllable boundaries etc marked, then map from IPA to spelling via some kind of lookup table. We use a similar process in reverse in my research group for building datasets for doing Bayesian phylogenies of language families

5 comments

Clearly you are far more of a linguist than I am, but from such a perspective, I had a similar impression; I reloaded the page several times and none of the words struck me as being remotely plausibly English. These are worse than most Hollywood scifi words/names.
A significant improvement on letter-by-letter, but not that much harder, is to use n-grams: "two letters to predict the third" etc. Still not "industry grade", but the results start making more sense.
A letter-by-letter markov chain would lead to similar unconvincing results. As you said, vocal groups matter much more than single letters. If you know anything about korean, they actually group letters into characters that way. If one could build such a markov chain for English it would be very convincing I think.
You're right, I forgot that markov chains are memoryless
I used a letter by letter Markov chain for this: http://password.supply/

The output is definitely not convincing as actual words (but reasonable for somewhat more memorable passwords).

You should check out the VOLT paper, I think it would work well. It's a new technique for splitting up a vocabulary into subwords while minimizing entropy. These subwords could then be mixed and matched, maybe by a neural model, for better results.
Thank you for the reference. To save others a search, I believe this is the paper:

Vocabulary Learning via Optimal Transport for Neural Machine Translation - https://arxiv.org/abs/2012.15671

https://jingjing-nlp.github.io/volt-blog/

https://github.com/Jingjing-NLP/VOLT

I got "minable" on my first try and found it impressive and surprised that it wasn't a word. After 3 other reloads nothing else came up.
Definitely not a fake word. Coal, for instance, is a minable resource.

https://www.dictionary.com/browse/minable

Similarly, ”shitbin” was the second word on my first try, and I had to internet search to convince myself that it isn’t in fact a word.
It definitely is a word, since "mine" is an existing verb.
I got "episexic" and, well, I kind of like that one.