|
|
|
|
|
by shanehoban
766 days ago
|
|
Try to query it though via document.querySelectorAll('a') for example. It's a good first line of defense as a lot of scraping techniques do this approach. However, if you have a headless browser setup for scraping, and simply fetch the current URL while on the page[0], you can get the plain text, and do a regex search for email addresses which will get you the email address - albeit this is a strange approach to take I admit. [0]: fetch('./').then((res) => res.text()).then((text) => console.log(text)) |
|
Most basic scrappers, the ones that are not for your testing or devtools or automation or ... Actually use basic text, without any interpretation. They grep the source code, they don't run a dom and javascript engine, because it's a major difference in computing needs and speed.
I am not saying there is no evil scrapper doing dom evaluation, there are tons, I am reacting to your "FIRST line of defense", that one is scrambling the raw text, which is why we got there.
What parent is saying, is that this is trying to upgrade the defense that we have generated to stop the threat that evolved, but it forgot why we got there and thus makes itself vulnerable to the original threat.