| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shanehoban 766 days ago

Try to query it though via document.querySelectorAll('a') for example. It's a good first line of defense as a lot of scraping techniques do this approach.

However, if you have a headless browser setup for scraping, and simply fetch the current URL while on the page[0], you can get the plain text, and do a regex search for email addresses which will get you the email address - albeit this is a strange approach to take I admit.

[0]: fetch('./').then((res) => res.text()).then((text) => console.log(text))

2 comments

nolok 766 days ago

> It's a good first line of defense as a lot of scraping techniques do this approach.

Most basic scrappers, the ones that are not for your testing or devtools or automation or ... Actually use basic text, without any interpretation. They grep the source code, they don't run a dom and javascript engine, because it's a major difference in computing needs and speed.

I am not saying there is no evil scrapper doing dom evaluation, there are tons, I am reacting to your "FIRST line of defense", that one is scrambling the raw text, which is why we got there.

What parent is saying, is that this is trying to upgrade the defense that we have generated to stop the threat that evolved, but it forgot why we got there and thus makes itself vulnerable to the original threat.

link

animuchan 766 days ago

Absolutely. The basic tools just fetch sites recursively and use regular expressions. The advanced tools are Chromium-based, so will render SVGs just fine (and then potentially run OCR / AI to extract text even from JPEGs).

This technique protects from a "neither here nor there" subset of programs, I wonder how large is that set in practice.

link

cqqxo4zV46cp 766 days ago

If they’re saying it, I think that they’re wrong. One of those naively written scrapers won’t pick up an email address ‘protected’ in this way. It’s simply continuing the game of cat and mouse.

link

nkozyra 766 days ago

You can just query for all the image elements and then read any svg using the document model.

This is trivial to overcome for most basic scrapers and not much harder even if you try to obfuscate with paths for more sophisticated ones.

link