| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by r_singh 273 days ago

It’s fair to be angry at abuse and "aggressive bots", but it's important to remember most large platforms—including the ones being scraped—built their own products on scraping too.

I run an e-commerce-specific scraping API that helps developers access SERP, PDP, and reviews data. I've noticed the web already has unsaid balances: certain traffic patterns and techniques are tolerated, others clearly aren’t. Most sites handle reasonable, well-behaved crawlers just fine.

Platforms claim ownership of UGC and public data through dark patterns and narrative control. The current guidelines are a result of supplier convenience, and there are several cases where absolutely fundamental web services run by the largest companies in the world themselves breach those guidelines (including those funded by the fund running this site). We need standards that treat public data as a shared resource with predictable, ethical access for everyone, not just for those with scale or lobbying power.

1 comments

karlshea 273 days ago

If you’re running a well-behaved crawler (for example one that respects nofollow, and doesn’t try every single product filter combination it can find) then fine. If you don’t, then I don’t have any sympathy for the consequences that your niche of the industry caused.

Not everyone has the budget for unlimited bandwidth and compute, and in several of my clients’ cases that’s been >95% of all traffic.

People running these bots with AI/VC capital are just script kiddies that forgot that not every site is a boatload of app servers behind Cloudflare.

link

r_singh 273 days ago

My service only extracts public data major retailers, not indie sites, and deducts more credits for lower-traffic domains to offset load differences.

It would be great if there were reliable ways to distinguish good bots from bad ones — many actually improve discoverability and sales. I see this with affiliate shopping sites that depend on e-commerce data, though that impact is hard to trace directly.

The bad actors are the ones cloning sites or using data for manipulation and propaganda.

link