| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kgeist 787 days ago
	The tool found "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890" in our codebase as a high entropy line :)

2 comments

josephg 787 days ago

You can use LLMs as compressors, and I wonder how it would go with that.

The approach is simple: Turn the file into a stream of tokens. For each token, ask a language model to generate the full set of predictions based on context, and sort based on likelihood. Look where the actual token appears in the sorted list. Low entropy symbols will be near the start of the list, and high entropy tokens near the end.

I suspect most language models would deal with your alphabet example just fine, while still correctly spotting passwords and API keys. It would be a fun experiment to try!

link

g15jv2dp 787 days ago

Well, it is...

link

saurik 787 days ago

I mean, it certainly has a low Kolmogorov complexity (which is what I would really want to be measuring somehow for this tool... note that I am not claiming that is possible: just an ideal); I am unsure whether how that affects the related bounds on Shannon entropy, though.

link

ngneer 787 days ago

Then use it as your password ;)

link

jraph 787 days ago

…a very verbose way to match alphanumeric characters :-)

link