| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kqr 78 days ago

I have a hypothesis email scrapers don't parse HTML at all. I suspect they search the raw bytestring for @ characters and take whatever's on either side of it. That probably gets them as many addresses as they can realistically use at a fraction of the cost, given how expensive HTML parsing can be.

(Similarly, I'm sure most links can be found by searching the bytestring for "href" and taking what's to the right of it.)

This would explain why HTML entities are so effective.

On the other hand, surely the TLS handshake is far more expensive than HTML parsing? Maybe it's to avoid parser failure modes that consume a lot of resources?

4 comments

Someone 78 days ago

> This would explain why HTML entities are so effective.

Could also be that they learned that sending spam to obfuscated addresses doesn’t gets much response. Such messages might get filtered out more and/or addressees might be less inclined to reply to it.

link

BorisMelnik 78 days ago

it really varies, you are correct most modern ones search the byte string for @ characters but there are probably hundreds of different methods out there in black hat marketing circles to scrape emails.

link

mcmcmc 78 days ago

Haven’t heard “black hat marketing” before but that’s very fitting for a lot of the “growth hackers” out there

link

curiousObject 78 days ago

I believe you’re right. But sometimes, you really have to think about how mad your adversary is.

A dog will keep biting long after that is a disastrous plan.

link

j45 78 days ago

Token based extraction around the @ is definitely one way that can work with a few tweaks.

link