|
With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable. : This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc). Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial. |
- If they provide a API, then use it.
- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).
- If you can get cached data from somewhere that works, then use that.
Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.
The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.