|
|
|
|
|
by thomasluce
1841 days ago
|
|
I worked for an internet scraping/statistics gathering company some years ago, and we used this approach alongside a few others to find mailing addresses embedded in websites. Basically use LZW-type compression with entropy information only trained on known addresses, and then compress a document, looking for the section of the document with the highest compression ratio. It worked decently well, and surprisingly better than a lot of other, more standard approaches just because of the wild non-uniformity of human-generated content on the web. |
|