| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cowsaymoo 787 days ago
	I transcend this problem by making all my database passwords 'abcd'

3 comments

kgeist 787 days ago

The tool found "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890" in our codebase as a high entropy line :)

link

josephg 787 days ago

You can use LLMs as compressors, and I wonder how it would go with that.

The approach is simple: Turn the file into a stream of tokens. For each token, ask a language model to generate the full set of predictions based on context, and sort based on likelihood. Look where the actual token appears in the sorted list. Low entropy symbols will be near the start of the list, and high entropy tokens near the end.

I suspect most language models would deal with your alphabet example just fine, while still correctly spotting passwords and API keys. It would be a fun experiment to try!

link

g15jv2dp 787 days ago

Well, it is...

link

saurik 787 days ago

I mean, it certainly has a low Kolmogorov complexity (which is what I would really want to be measuring somehow for this tool... note that I am not claiming that is possible: just an ideal); I am unsure whether how that affects the related bounds on Shannon entropy, though.

link

ngneer 787 days ago

Then use it as your password ;)

link

jraph 787 days ago

…a very verbose way to match alphanumeric characters :-)

link

nvy 787 days ago

Username: postgres

Password: postgres

link

randomtoast 787 days ago

Reminds me of https://xkcd.com/936/ I think "correct horse battery staple" has a low entropy, since it is just ordinary looking words (strings).

link

josephg 787 days ago

A quick Google search suggests English has about 10 bits of entropy per word. Having a long password like that can still have high total entropy I suppose, but it has a low entropy density.

link

kqr 786 days ago

Maybe 10 bits is the average over the dictionary – which is what matters here, but over normal text it is significantly less. Our best current estimation for relatively high-level text (texts published by the EU) is 6 bits per word[1].

However, as our methods of predicting text improve, this number is revised down. LLMs ought to have made a serious dent in it, but I haven't looked up any newer results.

Anyway, all of this to say is that which words are chosen matters, but how they are put together matters perhaps more.

[1]: http://arxiv.org/pdf/1606.06996

link

soraminazuki 785 days ago

The diceware method is supposed to generate totally random words, so it should fundamentally be unpredictable unless there's a flaw in the source of randomness.

link