Do you honestly believe all site scraper people/companies are ethical enough to go to whoever pays /them/ to scrape data from a competitor's site and say "oh they offer an API to access this data let's pay for that", instead of "why pay for that data when we can scrape it right off their site"?
Also, not all types of company will provide API endpoints. It all depends on the type of site - for example, an online shop might not wish to provide easily accessible data on offered products and prices, to their competitors who may wish to undercut them. Why would an online shop do that?
I run a large scraper farm against several large sites. They're not online shops, and we don't compete with them. But they do have hundreds of thousands of data points that we use to provide reports and analytics for our clients, who also do not compete with the sites.
I absolutely would pay for an API that provides that data. I'd be willing to pay 10x more than the cost of maintaining and running the scrapers.
But the sites being scraped have no interest in that.
Have you tried approaching those sites and asking them to provide an API, pointing out that it would be easier for both of you in the long run? Or are you just assuming they wouldn't do it.
Because right now, I sure wish that the bots - which comprise probably 2/3 of my traffic - are causing me huge headaches and I wish that the people doing it would tell me what the heck they want.
Building and maintaining the scraper is the not cost they would use to measure it internally. It’s the cost to build the API, and support it and perhaps any perverse incentive it creates where even more data flows out to competitors.
For all intents and purposes, this isn't competitive data for them. There aren't really competitors in the space anyway, the barrier to entry is ridiculous. In fact, by law, operators in the industry are required to share this particular data with each other and industry regulators. But they don't share it with outside parties in the aggregate form we need it in. Hence, the scraping.
Well, you don't need an api, just a CSV file with a catalog.
The scraping company WILL use the API/CSV file... they will probably also still charge their customer for scraping, so it's a win-win :D
You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.
You can be principled and insist on blocking bots and spend a lot of time and money on tools, people, and ultimately hosting because the bots will always win; or you can offer the data for free/minimal fee and serve it with almost zero cost and cache it so you can do that with a micro sized server.
You can always lie about some of the prices if you want, but you will just encourage bots again.
Ethics are nice, but let's be honest, very lacking. Sometimes it's better to be pragmatic.
> You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.
There's the problem right there. The prices and product data are publicy visible - because there is a target audience of /humans/ for whom the site is designed and intended to be used by. The site is not there to cater for a competitor's scrapers.
I don't care how much people couch their unethical behaviour in "the data is publically available", the basic fact is most if not all websites exist for human eyeballs to look at them. They do not exist for arseholes to DOS them by inundating them with scrapers.
From my perspective, the problem is that the data that is offered isn't really "for humans". The data is for convincing the humans to buy/pay or worse, browse and watch ads as a result.
But overall, information is one of those goods that has intrinsic properties like no other. It can be copied, infinitely. And we haven't yet figured out the dynamics of how to reason about it, so it feels like we're pretending they're physical goods.
Edit. Side note. I'd go further and say that some of the data is even worse, it's "offered" with the real intention being to confuse the users into performing non-optimally in the market. Look at Amazon/Ebay/AliExpress/Google listings for evidence of that. Just Google - Google is a ML and scraping power house, and the best they can muster is to be spammed with fake websites and duplicate/confusing listings.
You hit the nail on the head. It's hard to have sympathy for site operators complaining about scraping, where almost every site does its best[0] to make using it a time consuming, potentially risky and overall annoying ordeal. Not to mention, information asymmetry is anathema to a well-functioning market, and yet no. 1 reason for fighting bots given in the whole thread here is a desire to maintain that information asymmetry.
And that's also the dirty secret behind the "attention economy": it's whole point is to make things as inefficient as possible, because if you're making money on people's attention, you need to first steal it (by distracting them from what they're trying to achieve), and then either direct towards your goals (vs. those of the users), or stretch it out to maximize their exposure to advertising.
--
[0] - Sometimes unintentionally. Unfortunately, the overall zeitgeist of UX design is heavily influenced by bad players, so default advice in the industry is often already intrinsically user-hostile.
> Not to mention, information asymmetry is anathema to a well-functioning market, and yet no. 1 reason for fighting bots given in the whole thread here is a desire to maintain that information asymmetry.
> the basic fact is most if not all websites exist for human eyeballs to look at them.
There's a whole ethical subthread here of websites trying to making the experience for those humans miserable, and taking away the agency necessary to protect oneself from that. A browser is a user agent. So is a screen reader. So is a script one writes to not deal with bullshit fluff, when all one wants is a simple table of products, features and prices.
I agree 100%, but it is a fact of life, and sometimes it's better to just minimize the fuzz and focus on the things that matter.
Your argument is perfectly valid and applies to offline activities as well (what stops a competitor from walking through the aisles of a Walmart or Costco?), but this is a battle that can't be won, there are too many parasitic actors. It is human nature.
Understanding your competitor's pricing is not "parasitic", it's research. Every company I've ever worked for that sells something online scrapes their competitors in some way (whether with bots or with interns).
Let's not encourage these unethical people to even think of using human eyeballs and manual data entry for their scraping instead of bots. That sounds pretty darn unethical.
My point was more that we can accept with, and live with, scrapers but expect some minimal level of consideration if you’re going to abuse our very expensively gathered dataset. Sending us 10x daily traffic so you can scrape quicker than the fair usage policy of our API allows is just… poor etiquette? Unkind? Not really sure how to phrase it. I’m exhausted after multiple 18 hours days trying to keep our website online for the public.
Also, not all types of company will provide API endpoints. It all depends on the type of site - for example, an online shop might not wish to provide easily accessible data on offered products and prices, to their competitors who may wish to undercut them. Why would an online shop do that?