Hacker News new | ask | show | jobs
by r_singh 227 days ago
The Internet isn’t possible without scraping. For all the sentiment against scraping public data, doing so remains legal and essential to a lot of the services we use everyday. I think setting guidelines and shaping the web for reduced friction aimed at fair usage rather than turning it political would be the right thing to do.
3 comments

There were already guidelines, these trash people aren’t following them. That’s why there’s now “sentiment” against them.
It’s fair to be angry at abuse and "aggressive bots", but it's important to remember most large platforms—including the ones being scraped—built their own products on scraping too.

I run an e-commerce-specific scraping API that helps developers access SERP, PDP, and reviews data. I've noticed the web already has unsaid balances: certain traffic patterns and techniques are tolerated, others clearly aren’t. Most sites handle reasonable, well-behaved crawlers just fine.

Platforms claim ownership of UGC and public data through dark patterns and narrative control. The current guidelines are a result of supplier convenience, and there are several cases where absolutely fundamental web services run by the largest companies in the world themselves breach those guidelines (including those funded by the fund running this site). We need standards that treat public data as a shared resource with predictable, ethical access for everyone, not just for those with scale or lobbying power.

If you’re running a well-behaved crawler (for example one that respects nofollow, and doesn’t try every single product filter combination it can find) then fine. If you don’t, then I don’t have any sympathy for the consequences that your niche of the industry caused.

Not everyone has the budget for unlimited bandwidth and compute, and in several of my clients’ cases that’s been >95% of all traffic.

People running these bots with AI/VC capital are just script kiddies that forgot that not every site is a boatload of app servers behind Cloudflare.

My service only extracts public data major retailers, not indie sites, and deducts more credits for lower-traffic domains to offset load differences.

It would be great if there were reliable ways to distinguish good bots from bad ones — many actually improve discoverability and sales. I see this with affiliate shopping sites that depend on e-commerce data, though that impact is hard to trace directly.

The bad actors are the ones cloning sites or using data for manipulation and propaganda.

Well sure, but these guidelines exist, the robots.txt guidelines has been an industry-led, self-governing / self-restrictive standard. But newer bots ignore them. It'll take years for legislation to catch up, and even then it would be by country or region, not something global because that's not how the internet works.

Even if there is legislation or whatever, you can sue an OpenAI or a Microsoft, but starting a new company that does scraping and sells it on to the highest bidder is trivial.

As the legal history around scraping shows, it’s almost always the smaller company that gets sued out of existence. Taking on OpenAI or Microsoft, as you suggest, isn’t realistic — even governments often struggle to hold them accountable.

And for the record, large companies regularly ignore robots.txt themselves: LinkedIn, Google, OpenAI, and plenty of others.

The reality is that it’s the big players who behave like the aggressors, shaping the rules and breaking them when convenient. Smaller developers aren’t the problem, they’re just easier to punish.

What ? What do you mean ?
As posted in another comment, they run a scraping API. I think their opinion is at least slightly biased.
To be fair the heyday of unshit search was driven by mostly-consensual scraping.

Today there are far too many people scraping stuff that isn't intended to be scraped, for profit, and doing it in a heavy-handed way that actually does have a negative and continuous effect on the victim's capacity.

Everyone from AI services too lazy or otherwise unwilling to cache to companies exfiltrating some kind of data for their own commercial purposes.

With peering bandwidth being freely distributed to ISPs and consumers being fed media and subsidised services up until their necks makes the counter argument smell of narrative control rather than technical or financial constraints

But as I’m growing older I’m learning that the tech industry is mostly politically driven and relies on truth obfuscation as explained by Peter Thiel rather than real empowerment

It’s facilitating accumulation of control and power at an unparalleled pace. If anything it’s proving to be more unjust than the feudal systems it promises to replace.

I may have been too harsh. I love capitalism, technology, and software—they’ve built a meritocratic world and given me the tools to build my own life.

AI and technology feel like my best friend, but also my worst enemy when they edge toward learned helplessness. That tension exists with anything we depend on: the closer we get, the more power it holds.

The relationship between user and technology is becoming deeply intimate as systems gain reach and control. It’s important to stay optimistic but skeptical—and to keep protesting everything—because the work is moving faster than our ability to register its consequences.

Reading back, I realise I drifted into more of a monologue than a conversation. I get carried away when I’m trying to reason things out in public. Still, I stand by the core point about balance and transparency in how we shape the web.