Hacker News new | ask | show | jobs
by kqr 78 days ago
I have a hypothesis email scrapers don't parse HTML at all. I suspect they search the raw bytestring for @ characters and take whatever's on either side of it. That probably gets them as many addresses as they can realistically use at a fraction of the cost, given how expensive HTML parsing can be.

(Similarly, I'm sure most links can be found by searching the bytestring for "href" and taking what's to the right of it.)

This would explain why HTML entities are so effective.

On the other hand, surely the TLS handshake is far more expensive than HTML parsing? Maybe it's to avoid parser failure modes that consume a lot of resources?

4 comments

> This would explain why HTML entities are so effective.

Could also be that they learned that sending spam to obfuscated addresses doesn’t gets much response. Such messages might get filtered out more and/or addressees might be less inclined to reply to it.

it really varies, you are correct most modern ones search the byte string for @ characters but there are probably hundreds of different methods out there in black hat marketing circles to scrape emails.
Haven’t heard “black hat marketing” before but that’s very fitting for a lot of the “growth hackers” out there
I believe you’re right. But sometimes, you really have to think about how mad your adversary is.

A dog will keep biting long after that is a disastrous plan.

Token based extraction around the @ is definitely one way that can work with a few tweaks.