| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mrbukkake 1664 days ago
	Nice idea, naive implementation which leads to the output being unconvincing as hypothetical English words. I had a brief look and it seems to be proportionally selecting and sticking together sequences of letters sampled from English words (lib/word-probability.ts). This doesn't take into account syllable boundaries, the way the English spelling system maps between phones/phonemes and the phonotactic properties of English which is why the output looks unconvincing. A better approach would be to use a markov chain built from sampling English text letter-by letter... an even better approach would be to build your stats from some source of English words in IPA transcription with syllable boundaries etc marked, then map from IPA to spelling via some kind of lookup table. We use a similar process in reverse in my research group for building datasets for doing Bayesian phylogenies of language families

5 comments

KennyBlanken 1664 days ago

Clearly you are far more of a linguist than I am, but from such a perspective, I had a similar impression; I reloaded the page several times and none of the words struck me as being remotely plausibly English. These are worse than most Hollywood scifi words/names.

link

rlayton2 1664 days ago

A significant improvement on letter-by-letter, but not that much harder, is to use n-grams: "two letters to predict the third" etc. Still not "industry grade", but the results start making more sense.

link

bruce343434 1664 days ago

A letter-by-letter markov chain would lead to similar unconvincing results. As you said, vocal groups matter much more than single letters. If you know anything about korean, they actually group letters into characters that way. If one could build such a markov chain for English it would be very convincing I think.

link

mrbukkake 1664 days ago

You're right, I forgot that markov chains are memoryless

link

dminor 1664 days ago

I used a letter by letter Markov chain for this: http://password.supply/

The output is definitely not convincing as actual words (but reasonable for somewhat more memorable passwords).

link

rajansaini 1664 days ago

You should check out the VOLT paper, I think it would work well. It's a new technique for splitting up a vocabulary into subwords while minimizing entropy. These subwords could then be mixed and matched, maybe by a neural model, for better results.

link

lioeters 1664 days ago

Thank you for the reference. To save others a search, I believe this is the paper:

Vocabulary Learning via Optimal Transport for Neural Machine Translation - https://arxiv.org/abs/2012.15671

https://jingjing-nlp.github.io/volt-blog/

https://github.com/Jingjing-NLP/VOLT

link

themdonuts 1664 days ago

I got "minable" on my first try and found it impressive and surprised that it wasn't a word. After 3 other reloads nothing else came up.

link

tw04 1664 days ago

Definitely not a fake word. Coal, for instance, is a minable resource.

https://www.dictionary.com/browse/minable

link

phs318u 1664 days ago

Similarly, ”shitbin” was the second word on my first try, and I had to internet search to convince myself that it isn’t in fact a word.

link

thaumasiotes 1664 days ago

It definitely is a word, since "mine" is an existing verb.

link

Wistar 1663 days ago

I got "episexic" and, well, I kind of like that one.

link