Hacker News new | ask | show | jobs
by dragontamer 4032 days ago
Why not "Article Adjective Adjective Noun Adverb Verb Article Adjective Adjective Noun"?

If you're making full sentences anyway, the grammar of the sentence doesn't need to change much. The vast majority of the entropy is already in the words themselves.

Example sentence (generated by me, not a RNG): "the tiny hairy fish quickly paints a big scary monster".

EDIT: With 10 words, each from the 252 most common words... sentences of this type would have an entropy of more than 10^24 or 2^80. I guess "articles" are pretty much "The" vs "A / An" however, so there really are only 8 words of note...

3 comments

Sorry to double comment, but this is exactly how my hobby project hipku works

http://hipku.gabrielmartin.net

Well, one, I'd love a more general solution where I could just say "generate a sentence with n bits of entropy" and my algorithm would spin out a sentence of the correct (arbitrary) length. (Hmm... Markov chains?) Or maybe add other mnemonic modifications, like rhymes. And two, I still need an algorithm to conjugate verbs and whatnot, though I suppose that part could just be left to the user. (You get n diceware words — make your own sentence out of them.) But that's boring!

In regards to word commonality, I'm pretty sure you could in fact use something like the 5000 most common words. The people who care about this kind of stuff tend to have large vocabularies!

I think Markov chains would be a bad idea for the use case of passwords because some words always follow certain words.
Ah, true — I was thinking more along the lines of "part of speech" Markov chains, if that's even possible. (As in, just an endless stream of "article noun verb adjective noun adverb conjunction adjective noun verb adjective noun conjunction adjective etc." that could then be mad-libbed by diceware.)
It is possible (and, I think, a rather clever idea).

You could, for example, use a part-of-speech tagged corpus (a large collection of text where each word was tagged with its PoS by a grad student). Just train a Markov model on the parts of speech instead of the words themselves, and you would be able to generate English-like mad-libs.

To make it less obvious, have a few other kinds of sentence. But make those kinds clearly differentiated by length.