Hacker News new | ask | show | jobs
by thomasluce 1841 days ago
I worked for an internet scraping/statistics gathering company some years ago, and we used this approach alongside a few others to find mailing addresses embedded in websites. Basically use LZW-type compression with entropy information only trained on known addresses, and then compress a document, looking for the section of the document with the highest compression ratio.

It worked decently well, and surprisingly better than a lot of other, more standard approaches just because of the wild non-uniformity of human-generated content on the web.

1 comments

Does that mean you were doing an LZW compression but with a fixed table?
Yes, exactly. We pre-built the table with a ton of hand-picked mailing addresses copy-pasted out of a bunch of free-text and then just kept using that one.
Great, I need to try something like that.