Hacker News new | ask | show | jobs
by nolok 766 days ago
> It's a good first line of defense as a lot of scraping techniques do this approach.

Most basic scrappers, the ones that are not for your testing or devtools or automation or ... Actually use basic text, without any interpretation. They grep the source code, they don't run a dom and javascript engine, because it's a major difference in computing needs and speed.

I am not saying there is no evil scrapper doing dom evaluation, there are tons, I am reacting to your "FIRST line of defense", that one is scrambling the raw text, which is why we got there.

What parent is saying, is that this is trying to upgrade the defense that we have generated to stop the threat that evolved, but it forgot why we got there and thus makes itself vulnerable to the original threat.

2 comments

Absolutely. The basic tools just fetch sites recursively and use regular expressions. The advanced tools are Chromium-based, so will render SVGs just fine (and then potentially run OCR / AI to extract text even from JPEGs).

This technique protects from a "neither here nor there" subset of programs, I wonder how large is that set in practice.

If they’re saying it, I think that they’re wrong. One of those naively written scrapers won’t pick up an email address ‘protected’ in this way. It’s simply continuing the game of cat and mouse.