Hacker News new | ask | show | jobs
by kgeist 741 days ago
The tool found "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890" in our codebase as a high entropy line :)
2 comments

You can use LLMs as compressors, and I wonder how it would go with that.

The approach is simple: Turn the file into a stream of tokens. For each token, ask a language model to generate the full set of predictions based on context, and sort based on likelihood. Look where the actual token appears in the sorted list. Low entropy symbols will be near the start of the list, and high entropy tokens near the end.

I suspect most language models would deal with your alphabet example just fine, while still correctly spotting passwords and API keys. It would be a fun experiment to try!

Well, it is...
I mean, it certainly has a low Kolmogorov complexity (which is what I would really want to be measuring somehow for this tool... note that I am not claiming that is possible: just an ideal); I am unsure whether how that affects the related bounds on Shannon entropy, though.
Then use it as your password ;)
…a very verbose way to match alphanumeric characters :-)