| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thomasluce 1841 days ago
	I worked for an internet scraping/statistics gathering company some years ago, and we used this approach alongside a few others to find mailing addresses embedded in websites. Basically use LZW-type compression with entropy information only trained on known addresses, and then compress a document, looking for the section of the document with the highest compression ratio. It worked decently well, and surprisingly better than a lot of other, more standard approaches just because of the wild non-uniformity of human-generated content on the web.

1 comments

ta988 1841 days ago

Does that mean you were doing an LZW compression but with a fixed table?

link

thomasluce 1841 days ago

Yes, exactly. We pre-built the table with a ton of hand-picked mailing addresses copy-pasted out of a bunch of free-text and then just kept using that one.

link

ta988 1840 days ago

Great, I need to try something like that.

link